In our final group assignment, we will analyze data about Airbnb listings and fit a model to predict the total cost for two people staying 4 nights in an Airbnb in a city. We have chosen Amsterdam for our analysis.
Our prediction is that the average price of a 4-night stays for 2 people in a room rate 4.98 or above in Amsterdam is 404 USD. However, our model only explains about 50% of price variation and this may estimate may be as low as 200 USD and as high as 815 USD.
The following variables that can be used to determine the prices impact the price positively:
The following variables that can be used to determine the prices impact the price negatively:
The following variables were not used because they didn’t seem relevant for our models: - The property types and number of beds capture the same information as room type and the number of bedrooms. - The presence of a license is not relevant. - Last reviews, compared to the average rating, don’t give additional information.
Next potential steps: Using these variables, we were able to explain up to 50% of the price differences. Other variables such as the photos, number of visits, the host presentation can help explain more accurately the price. Areas for further exploration include exploring different cities, seasons in terms of price.
In our final group assignment we are going to analyse data about Airbnb listings and fit a model to predict the total cost for two people staying 4 nights in an AirBnB in a city. We have chosen Amsterdam for our analysis.
Our AirBnB data come from insideairbnb.com; it was originally scraped from airbnb.com.
Below we load the data from Airbnb:
Let’s first get an overview of our data.
Afterward we will wrangle the data so that we can work with it in our models and plot some of the key relationships.
We use glimpse and skim to gain a general view of the data. We have 74 variables and 16,116 observations. Variables include 24 character variables, 5 date variables, 8 logical variables, and 37 numeric variables. We also have some missing values throughout multiple of the variables in our data.
#Overview of Data
glimpse(listings)Rows: 16,116
Columns: 74
$ id <dbl> 2818, 20168, 25428, 27886…
$ listing_url <chr> "https://www.airbnb.com/r…
$ scrape_id <dbl> 2.021091e+13, 2.021091e+1…
$ last_scraped <date> 2021-09-07, 2021-09-07, …
$ name <chr> "Quiet Garden View Room &…
$ description <chr> "Quiet Garden View Room &…
$ neighborhood_overview <chr> "Indische Buurt (\"Indies…
$ picture_url <chr> "https://a0.muscache.com/…
$ host_id <dbl> 3159, 59484, 56142, 97647…
$ host_url <chr> "https://www.airbnb.com/u…
$ host_name <chr> "Daniel", "Alexander", "J…
$ host_since <date> 2008-09-24, 2009-12-02, …
$ host_location <chr> "Amsterdam, Noord-Holland…
$ host_about <chr> "Upon arriving in Amsterd…
$ host_response_time <chr> "within an hour", "within…
$ host_response_rate <chr> "100%", "100%", "N/A", "8…
$ host_acceptance_rate <chr> "100%", "100%", "0%", "10…
$ host_is_superhost <lgl> TRUE, FALSE, TRUE, TRUE, …
$ host_thumbnail_url <chr> "https://a0.muscache.com/…
$ host_picture_url <chr> "https://a0.muscache.com/…
$ host_neighbourhood <chr> "Indische Buurt", "Gracht…
$ host_listings_count <dbl> 1, 2, 2, 1, 2, 2, 2, 1, 1…
$ host_total_listings_count <dbl> 1, 2, 2, 1, 2, 2, 2, 1, 1…
$ host_verifications <chr> "['email', 'phone', 'revi…
$ host_has_profile_pic <lgl> TRUE, TRUE, TRUE, TRUE, T…
$ host_identity_verified <lgl> TRUE, TRUE, FALSE, TRUE, …
$ neighbourhood <chr> "Amsterdam, North Holland…
$ neighbourhood_cleansed <chr> "Oostelijk Havengebied - …
$ neighbourhood_group_cleansed <lgl> NA, NA, NA, NA, NA, NA, N…
$ latitude <dbl> 52.36435, 52.36407, 52.37…
$ longitude <dbl> 4.94358, 4.89393, 4.88487…
$ property_type <chr> "Private room in rental u…
$ room_type <chr> "Private room", "Private …
$ accommodates <dbl> 2, 2, 3, 2, 2, 1, 2, 3, 2…
$ bathrooms <lgl> NA, NA, NA, NA, NA, NA, N…
$ bathrooms_text <chr> "1.5 shared baths", "1 pr…
$ bedrooms <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ beds <dbl> 2, 1, 1, 1, 1, 1, 1, 1, 2…
$ amenities <chr> "[\"Single level home\", …
$ price <chr> "$59.00", "$106.00", "$12…
$ minimum_nights <dbl> 3, 1, 14, 2, 2, 2, 90, 4,…
$ maximum_nights <dbl> 28, 365, 120, 730, 1825, …
$ minimum_minimum_nights <dbl> 3, 1, 7, 2, 2, 2, 90, 4, …
$ maximum_minimum_nights <dbl> 3, 1, 14, 2, 2, 2, 90, 4,…
$ minimum_maximum_nights <dbl> 1125, 1125, 120, 1125, 18…
$ maximum_maximum_nights <dbl> 1125, 1125, 120, 1125, 18…
$ minimum_nights_avg_ntm <dbl> 3.0, 1.0, 13.8, 2.0, 2.0,…
$ maximum_nights_avg_ntm <dbl> 1125, 1125, 120, 1125, 18…
$ calendar_updated <lgl> NA, NA, NA, NA, NA, NA, N…
$ has_availability <lgl> TRUE, TRUE, TRUE, TRUE, T…
$ availability_30 <dbl> 3, 0, 1, 9, 11, 8, 4, 0, …
$ availability_60 <dbl> 28, 0, 1, 20, 27, 23, 34,…
$ availability_90 <dbl> 55, 0, 3, 47, 50, 50, 64,…
$ availability_365 <dbl> 124, 0, 57, 66, 298, 313,…
$ calendar_last_scraped <date> 2021-09-07, 2021-09-07, …
$ number_of_reviews <dbl> 280, 339, 5, 223, 353, 49…
$ number_of_reviews_ltm <dbl> 2, 0, 0, 4, 19, 17, 0, 0,…
$ number_of_reviews_l30d <dbl> 0, 0, 0, 2, 8, 10, 0, 0, …
$ first_review <date> 2013-08-25, 2014-01-17, …
$ last_review <date> 2019-11-21, 2020-03-27, …
$ review_scores_rating <dbl> 4.89, 4.44, 5.00, 4.95, 4…
$ review_scores_accuracy <dbl> 4.93, 4.69, 5.00, 4.93, 4…
$ review_scores_cleanliness <dbl> 5.00, 4.79, 5.00, 4.96, 4…
$ review_scores_checkin <dbl> 4.97, 4.63, 5.00, 4.95, 4…
$ review_scores_communication <dbl> 4.97, 4.62, 5.00, 4.92, 4…
$ review_scores_location <dbl> 4.68, 4.87, 5.00, 4.90, 4…
$ review_scores_value <dbl> 4.81, 4.49, 4.80, 4.80, 4…
$ license <chr> "0363 5F3A 5684 6750 D14D…
$ instant_bookable <lgl> TRUE, TRUE, FALSE, TRUE, …
$ calculated_host_listings_count <dbl> 1, 2, 1, 1, 2, 2, 1, 1, 2…
$ calculated_host_listings_count_entire_homes <dbl> 0, 0, 1, 0, 0, 0, 1, 1, 2…
$ calculated_host_listings_count_private_rooms <dbl> 1, 2, 0, 1, 2, 2, 0, 0, 0…
$ calculated_host_listings_count_shared_rooms <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ reviews_per_month <dbl> 2.86, 3.64, 0.11, 2.14, 4…
skim(listings)| Name | listings |
| Number of rows | 16116 |
| Number of columns | 74 |
| _______________________ | |
| Column type frequency: | |
| character | 24 |
| Date | 5 |
| logical | 8 |
| numeric | 37 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| listing_url | 0 | 1.00 | 33 | 37 | 0 | 16116 | 0 |
| name | 30 | 1.00 | 1 | 240 | 0 | 15766 | 0 |
| description | 223 | 0.99 | 1 | 1000 | 0 | 15711 | 0 |
| neighborhood_overview | 5711 | 0.65 | 1 | 1000 | 0 | 9495 | 0 |
| picture_url | 0 | 1.00 | 61 | 126 | 0 | 15968 | 0 |
| host_url | 0 | 1.00 | 38 | 43 | 0 | 14197 | 0 |
| host_name | 5 | 1.00 | 1 | 33 | 0 | 5166 | 0 |
| host_location | 35 | 1.00 | 2 | 86 | 0 | 560 | 0 |
| host_about | 7066 | 0.56 | 1 | 8920 | 0 | 7522 | 16 |
| host_response_time | 5 | 1.00 | 3 | 18 | 0 | 5 | 0 |
| host_response_rate | 5 | 1.00 | 2 | 4 | 0 | 57 | 0 |
| host_acceptance_rate | 5 | 1.00 | 2 | 4 | 0 | 91 | 0 |
| host_thumbnail_url | 5 | 1.00 | 55 | 106 | 0 | 14161 | 0 |
| host_picture_url | 5 | 1.00 | 57 | 109 | 0 | 14161 | 0 |
| host_neighbourhood | 5713 | 0.65 | 4 | 35 | 0 | 68 | 0 |
| host_verifications | 0 | 1.00 | 2 | 158 | 0 | 361 | 0 |
| neighbourhood | 5711 | 0.65 | 11 | 59 | 0 | 67 | 0 |
| neighbourhood_cleansed | 0 | 1.00 | 4 | 38 | 0 | 22 | 0 |
| property_type | 0 | 1.00 | 3 | 35 | 0 | 67 | 0 |
| room_type | 0 | 1.00 | 10 | 15 | 0 | 4 | 0 |
| bathrooms_text | 21 | 1.00 | 6 | 17 | 0 | 26 | 0 |
| amenities | 0 | 1.00 | 2 | 1488 | 0 | 14667 | 0 |
| price | 0 | 1.00 | 5 | 9 | 0 | 498 | 0 |
| license | 11561 | 0.28 | 6 | 63 | 0 | 3624 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| last_scraped | 0 | 1.00 | 2021-09-07 | 2021-09-08 | 2021-09-07 | 2 |
| host_since | 5 | 1.00 | 2008-09-24 | 2021-08-31 | 2015-04-07 | 3235 |
| calendar_last_scraped | 0 | 1.00 | 2021-09-07 | 2021-09-08 | 2021-09-07 | 2 |
| first_review | 2087 | 0.87 | 2011-04-19 | 2021-09-07 | 2017-10-30 | 2467 |
| last_review | 2087 | 0.87 | 2011-06-12 | 2021-09-07 | 2019-06-10 | 2019 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| host_is_superhost | 5 | 1 | 0.12 | FAL: 14103, TRU: 2008 |
| host_has_profile_pic | 5 | 1 | 1.00 | TRU: 16078, FAL: 33 |
| host_identity_verified | 5 | 1 | 0.67 | TRU: 10720, FAL: 5391 |
| neighbourhood_group_cleansed | 16116 | 0 | NaN | : |
| bathrooms | 16116 | 0 | NaN | : |
| calendar_updated | 16116 | 0 | NaN | : |
| has_availability | 0 | 1 | 0.96 | TRU: 15457, FAL: 659 |
| instant_bookable | 0 | 1 | 0.23 | FAL: 12408, TRU: 3708 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1.00 | 2.118184e+07 | 13520629.26 | 2.818000e+03 | 1.018020e+07 | 1.926593e+07 | 3.107548e+07 | 5.208280e+07 | ▇▇▆▅▂ |
| scrape_id | 0 | 1.00 | 2.021091e+13 | 0.00 | 2.021091e+13 | 2.021091e+13 | 2.021091e+13 | 2.021091e+13 | 2.021091e+13 | ▁▁▇▁▁ |
| host_id | 0 | 1.00 | 6.976052e+07 | 90271635.64 | 3.159000e+03 | 9.735558e+06 | 2.974134e+07 | 8.988321e+07 | 4.210037e+08 | ▇▂▁▁▁ |
| host_listings_count | 5 | 1.00 | 2.020000e+00 | 23.34 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.992000e+03 | ▇▁▁▁▁ |
| host_total_listings_count | 5 | 1.00 | 2.020000e+00 | 23.34 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.992000e+03 | ▇▁▁▁▁ |
| latitude | 0 | 1.00 | 5.237000e+01 | 0.02 | 5.229000e+01 | 5.236000e+01 | 5.236000e+01 | 5.238000e+01 | 5.243000e+01 | ▁▁▇▅▁ |
| longitude | 0 | 1.00 | 4.890000e+00 | 0.04 | 4.760000e+00 | 4.860000e+00 | 4.890000e+00 | 4.910000e+00 | 5.070000e+00 | ▁▆▇▁▁ |
| accommodates | 0 | 1.00 | 2.840000e+00 | 1.31 | 0.000000e+00 | 2.000000e+00 | 2.000000e+00 | 4.000000e+00 | 1.600000e+01 | ▇▃▁▁▁ |
| bedrooms | 898 | 0.94 | 1.530000e+00 | 0.95 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 5.000000e+01 | ▇▁▁▁▁ |
| beds | 97 | 0.99 | 1.760000e+00 | 1.47 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 3.300000e+01 | ▇▁▁▁▁ |
| minimum_nights | 0 | 1.00 | 3.990000e+00 | 20.99 | 1.000000e+00 | 2.000000e+00 | 2.000000e+00 | 3.000000e+00 | 1.100000e+03 | ▇▁▁▁▁ |
| maximum_nights | 0 | 1.00 | 6.081500e+02 | 540.67 | 1.000000e+00 | 2.100000e+01 | 1.125000e+03 | 1.125000e+03 | 1.825000e+03 | ▇▁▁▇▁ |
| minimum_minimum_nights | 3 | 1.00 | 3.950000e+00 | 20.99 | 1.000000e+00 | 2.000000e+00 | 2.000000e+00 | 3.000000e+00 | 1.100000e+03 | ▇▁▁▁▁ |
| maximum_minimum_nights | 3 | 1.00 | 4.150000e+00 | 21.03 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 3.000000e+00 | 1.100000e+03 | ▇▁▁▁▁ |
| minimum_maximum_nights | 3 | 1.00 | 6.855200e+02 | 532.48 | 1.000000e+00 | 2.800000e+01 | 1.125000e+03 | 1.125000e+03 | 1.825000e+03 | ▆▁▁▇▁ |
| maximum_maximum_nights | 3 | 1.00 | 2.672465e+05 | 23924509.45 | 1.000000e+00 | 2.800000e+01 | 1.125000e+03 | 1.125000e+03 | 2.147484e+09 | ▇▁▁▁▁ |
| minimum_nights_avg_ntm | 3 | 1.00 | 4.030000e+00 | 21.00 | 1.000000e+00 | 2.000000e+00 | 2.000000e+00 | 3.000000e+00 | 1.100000e+03 | ▇▁▁▁▁ |
| maximum_nights_avg_ntm | 3 | 1.00 | 2.666400e+05 | 23870381.65 | 1.000000e+00 | 2.800000e+01 | 1.125000e+03 | 1.125000e+03 | 2.142625e+09 | ▇▁▁▁▁ |
| availability_30 | 0 | 1.00 | 4.030000e+00 | 8.70 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 3.000000e+01 | ▇▁▁▁▁ |
| availability_60 | 0 | 1.00 | 8.840000e+00 | 18.14 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 2.000000e+00 | 6.000000e+01 | ▇▁▁▁▁ |
| availability_90 | 0 | 1.00 | 1.432000e+01 | 28.51 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 5.000000e+00 | 9.000000e+01 | ▇▁▁▁▁ |
| availability_365 | 0 | 1.00 | 5.532000e+01 | 107.91 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 4.700000e+01 | 3.650000e+02 | ▇▁▁▁▁ |
| number_of_reviews | 0 | 1.00 | 2.465000e+01 | 56.71 | 0.000000e+00 | 2.000000e+00 | 8.000000e+00 | 2.200000e+01 | 8.770000e+02 | ▇▁▁▁▁ |
| number_of_reviews_ltm | 0 | 1.00 | 1.390000e+00 | 7.48 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 4.220000e+02 | ▇▁▁▁▁ |
| number_of_reviews_l30d | 0 | 1.00 | 3.300000e-01 | 2.03 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.370000e+02 | ▇▁▁▁▁ |
| review_scores_rating | 2087 | 0.87 | 4.690000e+00 | 0.67 | 0.000000e+00 | 4.670000e+00 | 4.860000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| review_scores_accuracy | 2301 | 0.86 | 4.810000e+00 | 0.35 | 0.000000e+00 | 4.750000e+00 | 4.910000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| review_scores_cleanliness | 2300 | 0.86 | 4.700000e+00 | 0.44 | 0.000000e+00 | 4.600000e+00 | 4.830000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| review_scores_checkin | 2309 | 0.86 | 4.850000e+00 | 0.31 | 0.000000e+00 | 4.820000e+00 | 4.950000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| review_scores_communication | 2304 | 0.86 | 4.870000e+00 | 0.30 | 0.000000e+00 | 4.850000e+00 | 4.980000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| review_scores_location | 2309 | 0.86 | 4.730000e+00 | 0.33 | 0.000000e+00 | 4.600000e+00 | 4.810000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| review_scores_value | 2309 | 0.86 | 4.600000e+00 | 0.39 | 0.000000e+00 | 4.500000e+00 | 4.670000e+00 | 4.830000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| calculated_host_listings_count | 0 | 1.00 | 1.640000e+00 | 2.40 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 3.000000e+01 | ▇▁▁▁▁ |
| calculated_host_listings_count_entire_homes | 0 | 1.00 | 1.100000e+00 | 1.87 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 3.000000e+01 | ▇▁▁▁▁ |
| calculated_host_listings_count_private_rooms | 0 | 1.00 | 4.900000e-01 | 1.53 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 2.100000e+01 | ▇▁▁▁▁ |
| calculated_host_listings_count_shared_rooms | 0 | 1.00 | 1.000000e-02 | 0.13 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 4.000000e+00 | ▇▁▁▁▁ |
| reviews_per_month | 2087 | 0.87 | 6.800000e-01 | 1.72 | 1.000000e-02 | 1.100000e-01 | 2.700000e-01 | 6.200000e-01 | 1.078400e+02 | ▇▁▁▁▁ |
First out of interest we would like to plot a map of the raw dataset.
By plotting a cluster map that shows the number of flat listed for rent, we want to visualise the concentration of listings in the area of Amsterdam
#Map the concentration of listing in Amsterdam
listings %>%
leaflet() %>%
addProviderTiles("OpenStreetMap.Mapnik") %>%
addCircleMarkers(lng = ~longitude,
lat = ~latitude,
radius = 1,
fillOpacity = 0.3,
popup = ~listing_url,
label = ~property_type,
clusterOptions = markerClusterOptions()
)From this map, we concluded that more properties are available in the central area of Amsterdam. We would like to further investigate the relationship between the price of listings and the locations of properties.
We select some variables for our analysis which we believe would be relevant to determine the impact on prices.
We will hypothesize that the following are key variables which will will be using later on
price: cost per night (variable we will be use to predict)
property_type: type of accommodation (House, Apartment, etc.)
room_type:
number_of_reviews: Total number of reviews for the listing
review_scores_rating: Average review score (0 - 100)
neighbourhood*: three variables on a few major neighbourhoods in each city
accomodations: How many people can stay in the room per night
beds: Number of beds
host_is_superhost: Whether the host has superhost status
minimum_nights: Minimum nights to stay in the Airbnb
bedrooms: Number of bedrooms
reviews_per_month: Number of reviews per month
availability_30: Available stays next 30 days
license: license number
instant_bookable: Whether the room can be instantly booked
If you are interest in further information is is available here
Let’s first filter for our variables of interest
#Select the variables for further exploration
listings2 <- listings %>%
select(c("host_since",
"host_is_superhost",
"neighbourhood_cleansed",
"property_type",
"room_type",
"accommodates",
"price",
"review_scores_rating",
"number_of_reviews",
"minimum_nights",
"reviews_per_month",
"bedrooms","beds",
"availability_30",
"last_review",
"license",
"instant_bookable"))
# We want to choose the "bathrooms" as well, however,this variable has no valuesNow let’s ensure we can work with our data.
First lets briefly look at our new dataset
skim(listings2)| Name | listings2 |
| Number of rows | 16116 |
| Number of columns | 17 |
| _______________________ | |
| Column type frequency: | |
| character | 5 |
| Date | 2 |
| logical | 2 |
| numeric | 8 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| neighbourhood_cleansed | 0 | 1.00 | 4 | 38 | 0 | 22 | 0 |
| property_type | 0 | 1.00 | 3 | 35 | 0 | 67 | 0 |
| room_type | 0 | 1.00 | 10 | 15 | 0 | 4 | 0 |
| price | 0 | 1.00 | 5 | 9 | 0 | 498 | 0 |
| license | 11561 | 0.28 | 6 | 63 | 0 | 3624 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| host_since | 5 | 1.00 | 2008-09-24 | 2021-08-31 | 2015-04-07 | 3235 |
| last_review | 2087 | 0.87 | 2011-06-12 | 2021-09-07 | 2019-06-10 | 2019 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| host_is_superhost | 5 | 1 | 0.12 | FAL: 14103, TRU: 2008 |
| instant_bookable | 0 | 1 | 0.23 | FAL: 12408, TRU: 3708 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| accommodates | 0 | 1.00 | 2.84 | 1.31 | 0.00 | 2.00 | 2.00 | 4.00 | 16.00 | ▇▃▁▁▁ |
| review_scores_rating | 2087 | 0.87 | 4.69 | 0.67 | 0.00 | 4.67 | 4.86 | 5.00 | 5.00 | ▁▁▁▁▇ |
| number_of_reviews | 0 | 1.00 | 24.65 | 56.71 | 0.00 | 2.00 | 8.00 | 22.00 | 877.00 | ▇▁▁▁▁ |
| minimum_nights | 0 | 1.00 | 3.99 | 20.99 | 1.00 | 2.00 | 2.00 | 3.00 | 1100.00 | ▇▁▁▁▁ |
| reviews_per_month | 2087 | 0.87 | 0.68 | 1.72 | 0.01 | 0.11 | 0.27 | 0.62 | 107.84 | ▇▁▁▁▁ |
| bedrooms | 898 | 0.94 | 1.53 | 0.95 | 1.00 | 1.00 | 1.00 | 2.00 | 50.00 | ▇▁▁▁▁ |
| beds | 97 | 0.99 | 1.76 | 1.47 | 0.00 | 1.00 | 1.00 | 2.00 | 33.00 | ▇▁▁▁▁ |
| availability_30 | 0 | 1.00 | 4.03 | 8.70 | 0.00 | 0.00 | 0.00 | 0.00 | 30.00 | ▇▁▁▁▁ |
We see some issues that we will try and resolve.
First:
Price is sometimes given as a character string and we would like it to be a double variable.
Below we change the price to a double, i.e. a numeric variable.
#Change the 'price' to numeric
listings2 <- listings2 %>%
mutate(price = parse_number(price))
#Use `typeof(listing2$price)` to confirm that `price` is now stored as a number.
typeof(listings2$price)[1] "double"
We then want to impute the missing value of “license” with logical judgement instead of a license number.
This will make our analysis simply focus on whether people have a license or not.
#Impute the missing value of “license”
listings2<-listings2 %>%
#Add yes or no depending on whether variable is missing
mutate(license = ifelse(is.na(license), "no","yes"))
#Let's check our data by filtering for our license
License_check <- listings2 %>%
select(license)
#Let's take a glimpse at the results.
glimpse(License_check)Rows: 16,116
Columns: 1
$ license <chr> "yes", "yes", "no", "yes", "yes", "yes", "yes", "no", "no", "y…
The code works as hoped.
Next, we look at the variable property_type. We use the count function to determine how many categories there are and their frequency.
This will be useful later to summarize our property types in larger categories as there will be more observations and less variables in the regression.
#Determine how many categories there are and their frequency
count_prop <- listings2 %>%
#Select variable of interest - i.e. property types
select(property_type) %>%
#Count Number of each property type
count(property_type) %>%
#Arrange by most frequent category
arrange(desc(n)) %>%
#Create a new column for proportions
mutate(percentage=n/sum(n)*100) #Determine the proportion of the total listings the top 4 types make up
#Print Table
count_prop| property_type | n | percentage |
|---|---|---|
| Entire rental unit | 9776 | 60.7 |
| Private room in rental unit | 1850 | 11.5 |
| Entire residential home | 1045 | 6.48 |
| Entire townhouse | 429 | 2.66 |
| Entire condominium (condo) | 354 | 2.2 |
| Private room in bed and breakfast | 316 | 1.96 |
| Private room in residential home | 296 | 1.84 |
| Entire loft | 269 | 1.67 |
| Houseboat | 186 | 1.15 |
| Boat | 181 | 1.12 |
| Room in boutique hotel | 174 | 1.08 |
| Private room in townhouse | 156 | 0.968 |
| Entire serviced apartment | 127 | 0.788 |
| Private room in houseboat | 112 | 0.695 |
| Private room in guest suite | 110 | 0.683 |
| Room in hotel | 107 | 0.664 |
| Private room in condominium (condo) | 97 | 0.602 |
| Private room in boat | 94 | 0.583 |
| Private room in loft | 53 | 0.329 |
| Room in bed and breakfast | 36 | 0.223 |
| Entire guest suite | 34 | 0.211 |
| Entire villa | 30 | 0.186 |
| Shared room in rental unit | 24 | 0.149 |
| Private room in hostel | 23 | 0.143 |
| Room in aparthotel | 21 | 0.13 |
| Entire guesthouse | 20 | 0.124 |
| Room in hostel | 20 | 0.124 |
| Room in serviced apartment | 16 | 0.0993 |
| Private room | 15 | 0.0931 |
| Private room in guesthouse | 14 | 0.0869 |
| Entire place | 11 | 0.0683 |
| Private room in serviced apartment | 10 | 0.0621 |
| Entire cottage | 9 | 0.0558 |
| Shared room in hostel | 8 | 0.0496 |
| Tiny house | 8 | 0.0496 |
| Private room in farm stay | 7 | 0.0434 |
| Private room in tiny house | 6 | 0.0372 |
| Private room in villa | 6 | 0.0372 |
| Entire cabin | 5 | 0.031 |
| Barn | 4 | 0.0248 |
| Entire bed and breakfast | 4 | 0.0248 |
| Entire bungalow | 4 | 0.0248 |
| Private room in casa particular | 4 | 0.0248 |
| Camper/RV | 3 | 0.0186 |
| Entire home/apt | 3 | 0.0186 |
| Floor | 3 | 0.0186 |
| Private room in cabin | 3 | 0.0186 |
| Shared room in houseboat | 3 | 0.0186 |
| Shared room in residential home | 3 | 0.0186 |
| Campsite | 2 | 0.0124 |
| Entire chalet | 2 | 0.0124 |
| Private room in dome house | 2 | 0.0124 |
| Private room in earth house | 2 | 0.0124 |
| Private room in floor | 2 | 0.0124 |
| Private room in island | 2 | 0.0124 |
| Shared room in bed and breakfast | 2 | 0.0124 |
| Tower | 2 | 0.0124 |
| Yurt | 2 | 0.0124 |
| Bus | 1 | 0.00621 |
| Cave | 1 | 0.00621 |
| Earth house | 1 | 0.00621 |
| Private room in bungalow | 1 | 0.00621 |
| Private room in nature lodge | 1 | 0.00621 |
| Room in casa particular | 1 | 0.00621 |
| Shared room in boat | 1 | 0.00621 |
| Shared room in loft | 1 | 0.00621 |
| Tipi | 1 | 0.00621 |
The top 4 common property types are “Entire rental unit”,“Private room in rental unit”,“Entire residential home” and “Entire townhouse”. The top 4 types make up 60.66%, 11.48%, 6.48%, and 2.66% of the total listings
This will also make it easier for our regressions at a later stage given this is categorical variables.
#Simplify the "property_type"
listings2 <- listings2 %>%
#Create new names called other for properties not in the top four categories
mutate(prop_type_simplified = case_when(
#Specifying names to be kept
property_type %in% c("Entire rental unit",
"Private room in rental unit",
"Entire residential home",
"Entire townhouse") ~ property_type,
#Specifying ot leave remaining in other
TRUE ~ "Other"))
#We check the `prop_type_simplified` was correctly made
listings2 %>%
#Count each category
count(property_type, prop_type_simplified) %>%
#Arrange by largest category
arrange(desc(n)) | property_type | prop_type_simplified | n |
|---|---|---|
| Entire rental unit | Entire rental unit | 9776 |
| Private room in rental unit | Private room in rental unit | 1850 |
| Entire residential home | Entire residential home | 1045 |
| Entire townhouse | Entire townhouse | 429 |
| Entire condominium (condo) | Other | 354 |
| Private room in bed and breakfast | Other | 316 |
| Private room in residential home | Other | 296 |
| Entire loft | Other | 269 |
| Houseboat | Other | 186 |
| Boat | Other | 181 |
| Room in boutique hotel | Other | 174 |
| Private room in townhouse | Other | 156 |
| Entire serviced apartment | Other | 127 |
| Private room in houseboat | Other | 112 |
| Private room in guest suite | Other | 110 |
| Room in hotel | Other | 107 |
| Private room in condominium (condo) | Other | 97 |
| Private room in boat | Other | 94 |
| Private room in loft | Other | 53 |
| Room in bed and breakfast | Other | 36 |
| Entire guest suite | Other | 34 |
| Entire villa | Other | 30 |
| Shared room in rental unit | Other | 24 |
| Private room in hostel | Other | 23 |
| Room in aparthotel | Other | 21 |
| Entire guesthouse | Other | 20 |
| Room in hostel | Other | 20 |
| Room in serviced apartment | Other | 16 |
| Private room | Other | 15 |
| Private room in guesthouse | Other | 14 |
| Entire place | Other | 11 |
| Private room in serviced apartment | Other | 10 |
| Entire cottage | Other | 9 |
| Shared room in hostel | Other | 8 |
| Tiny house | Other | 8 |
| Private room in farm stay | Other | 7 |
| Private room in tiny house | Other | 6 |
| Private room in villa | Other | 6 |
| Entire cabin | Other | 5 |
| Barn | Other | 4 |
| Entire bed and breakfast | Other | 4 |
| Entire bungalow | Other | 4 |
| Private room in casa particular | Other | 4 |
| Camper/RV | Other | 3 |
| Entire home/apt | Other | 3 |
| Floor | Other | 3 |
| Private room in cabin | Other | 3 |
| Shared room in houseboat | Other | 3 |
| Shared room in residential home | Other | 3 |
| Campsite | Other | 2 |
| Entire chalet | Other | 2 |
| Private room in dome house | Other | 2 |
| Private room in earth house | Other | 2 |
| Private room in floor | Other | 2 |
| Private room in island | Other | 2 |
| Shared room in bed and breakfast | Other | 2 |
| Tower | Other | 2 |
| Yurt | Other | 2 |
| Bus | Other | 1 |
| Cave | Other | 1 |
| Earth house | Other | 1 |
| Private room in bungalow | Other | 1 |
| Private room in nature lodge | Other | 1 |
| Room in casa particular | Other | 1 |
| Shared room in boat | Other | 1 |
| Shared room in loft | Other | 1 |
| Tipi | Other | 1 |
#Removing the old property type
listings2 <- subset(listings2, select = -property_type)Our table comes out as expected and we can see that all smaller property types are in the other category.
#Using above dataset for the confidence interval calculations
formula_ci <- listings2%>%
group_by(prop_type_simplified) %>%
#Calculate weight's summary statistics for people exercising at least 3 times a week
# calculate mean, SD, count, SE, lower/upper 95% CI
summarise(
average_price=mean(price,na.rm=TRUE), #Mean, we choose to ignore any missing values by setting the 'na.rm = TRUE'
sd_price=sd(price,na.rm=TRUE), #Standard Deviation
count= n(), #Observations
t_critical = qt(0.975,count-1), #T-Critical at 95% Confidence Interval and these observations
se_price=sd_price/sqrt(count), #Standard Error
margin_of_error= t_critical*se_price, #Margin of Error
price_low= average_price - margin_of_error, #Lower interval
price_high= average_price + margin_of_error) #Upper Interval
formula_ci| prop_type_simplified | average_price | sd_price | count | t_critical | se_price | margin_of_error | price_low | price_high |
|---|---|---|---|---|---|---|---|---|
| Entire rental unit | 160 | 166 | 9776 | 1.96 | 1.68 | 3.29 | 157 | 163 |
| Entire residential home | 208 | 169 | 1045 | 1.96 | 5.23 | 10.3 | 198 | 219 |
| Entire townhouse | 218 | 134 | 429 | 1.97 | 6.47 | 12.7 | 206 | 231 |
| Other | 160 | 208 | 3016 | 1.96 | 3.79 | 7.44 | 152 | 167 |
| Private room in rental unit | 93.1 | 136 | 1850 | 1.96 | 3.17 | 6.22 | 86.8 | 99.3 |
ggplot(formula_ci, aes(x=average_price, y=prop_type_simplified, color=prop_type_simplified)) +
#geom_errorbar function allows us to show the two bars with confidence intervals
geom_errorbar(aes(xmin=price_low, xmax=price_high),width = 0.1, size=0.5)+
geom_point(aes(x=average_price),size=1)+
theme_bw() +
theme(legend.position = "none",axis.title.y=element_blank())+
labs(title = "Confidence interval for average price per property type",
subtitle = "95% confidence intervals overlap",
x = "Average Price"
) +
NULL The property type seems to have a significant effect on prices. The length of the intervals mainly vary due to the differences in sample size. The most common property type (Entire rental unit) has a small interval because it has the larger sample. We can estimate quite aaccuratly the average price in this category.
Next, we look at the variable neighbourhood_cleansed. We use the count function to determine how many categories there are and their frequency. The top 5 common neighbourhoods are “De Baarsjes - Oud-West”,“De Pijp - Rivierenbuurt”,“Centrum-West”, “Centrum-Oost” and “WesterPark”. The top 5 types make up 16.8%, 12.3%, 10.8%, 8.51%%, and 7.48% of the total listings
#Determine how many categories there are and their frequency
count_prop <- listings2 %>%
#Selecting variable of interest
select(neighbourhood_cleansed) %>%
#Counting observations
count(neighbourhood_cleansed) %>%
#Arrange by observations
arrange(desc(n)) %>%
#In Percentage
mutate(percentage=n/sum(n)*100) #Determine the proportion of the total listings the top 4 types make up
#Print Table
count_prop| neighbourhood_cleansed | n | percentage |
|---|---|---|
| De Baarsjes - Oud-West | 2701 | 16.8 |
| De Pijp - Rivierenbuurt | 1986 | 12.3 |
| Centrum-West | 1746 | 10.8 |
| Centrum-Oost | 1372 | 8.51 |
| Westerpark | 1205 | 7.48 |
| Zuid | 1196 | 7.42 |
| Oud-Oost | 1032 | 6.4 |
| Bos en Lommer | 931 | 5.78 |
| Oostelijk Havengebied - Indische Buurt | 733 | 4.55 |
| Oud-Noord | 518 | 3.21 |
| Watergraafsmeer | 476 | 2.95 |
| IJburg - Zeeburgereiland | 396 | 2.46 |
| Slotervaart | 349 | 2.17 |
| Noord-West | 323 | 2 |
| Noord-Oost | 227 | 1.41 |
| Buitenveldert - Zuidas | 217 | 1.35 |
| Geuzenveld - Slotermeer | 200 | 1.24 |
| De Aker - Nieuw Sloten | 116 | 0.72 |
| Osdorp | 114 | 0.707 |
| Gaasperdam - Driemond | 107 | 0.664 |
| Bijlmer-Centrum | 91 | 0.565 |
| Bijlmer-Oost | 80 | 0.496 |
Looking into this we see that there are many more high observation neighbourhoods compared to property types.
Therefore, we believe it makes more sense to split the neighbourhood by their relative prices.
We do this below
#Setting neighbourhood by price
neighbourhood_group <- listings2 %>%
#Variable of interest
group_by(neighbourhood_cleansed) %>%
#Getting median price
summarize(price_median = median(price)) %>%
#Arranging by highest price
arrange(desc(price_median))
#Print
neighbourhood_group| neighbourhood_cleansed | price_median |
|---|---|
| Centrum-Oost | 150 |
| Centrum-West | 150 |
| IJburg - Zeeburgereiland | 146 |
| De Pijp - Rivierenbuurt | 140 |
| Zuid | 134 |
| De Baarsjes - Oud-West | 129 |
| Oud-Oost | 129 |
| Westerpark | 129 |
| Oud-Noord | 125 |
| Watergraafsmeer | 125 |
| Noord-West | 120 |
| Noord-Oost | 119 |
| Oostelijk Havengebied - Indische Buurt | 114 |
| Buitenveldert - Zuidas | 112 |
| Bos en Lommer | 110 |
| Geuzenveld - Slotermeer | 100 |
| Slotervaart | 100 |
| De Aker - Nieuw Sloten | 98 |
| Osdorp | 91 |
| Gaasperdam - Driemond | 85 |
| Bijlmer-Centrum | 83 |
| Bijlmer-Oost | 79 |
We see that some central neighbourhoods are more expensive, which logically makes sense given this is a large city and that is where tourists and other activities are.
Let’s categorize neighbourhood by price
#Simplify the "property_type"
listings2 <- listings2 %>%
#Adding new variable names
mutate(neighbourhood_cleansed = case_when(
neighbourhood_cleansed == "Centrum-Oost" ~ "Top 5",
neighbourhood_cleansed == "Centrum-West" ~ "Top 5",
neighbourhood_cleansed == "IJburg - Zeeburgereiland" ~ "Top 5",
neighbourhood_cleansed == "Zuid" ~ "Top 5",
neighbourhood_cleansed == "De Pijp - Rivierenbuurt" ~ "Top 5",
neighbourhood_cleansed == "Oud-Noord" ~ "Top 6-10",
neighbourhood_cleansed == "Watergraafsmeer" ~ "Top 6-10",
neighbourhood_cleansed == "Oud-Oost" ~ "Top 6-10",
neighbourhood_cleansed == "Westerpark" ~ "Top 6-10",
neighbourhood_cleansed == "De Baarsjes - Oud-West" ~ "Top 6-10",
neighbourhood_cleansed == "Noord-West" ~ "Top 11-15",
neighbourhood_cleansed == "Noord-Oost" ~ "Top 11-15",
neighbourhood_cleansed == "Oostelijk Havengebied - Indische Buurt" ~ "Top 11-15",
neighbourhood_cleansed == "Buitenveldert - Zuidas" ~ "Top 11-15",
neighbourhood_cleansed == "Bos en Lommer" ~ "Top 11-15",
neighbourhood_cleansed == "Geuzenveld - Slotermeer" ~ "Remaining",
neighbourhood_cleansed == "Slotervaart" ~ "Remaining",
neighbourhood_cleansed == "De Aker - Nieuw Sloten" ~ "Remaining",
neighbourhood_cleansed == "Osdorp" ~ "Remaining",
neighbourhood_cleansed == "Gaasperdam - Driemond" ~ "Remaining",
neighbourhood_cleansed == "Bijlmer-Centrum" ~ "Remaining",
neighbourhood_cleansed == "Bijlmer-Oost" ~ "Remaining"))
#Check the `prop_type_simplified` was correctly made
listings2 %>%
#Counting by number of observations
count(neighbourhood_cleansed) %>%
#Displaying in descending order
arrange(desc(n)) | neighbourhood_cleansed | n |
|---|---|
| Top 5 | 6696 |
| Top 6-10 | 5932 |
| Top 11-15 | 2431 |
| Remaining | 1057 |
Let’s first look into how minimum nights look at Airbnb Amsterdam.
#Find out the most common values for the variable `minimum_nights`
listings2 %>%
#Count Observations
count(minimum_nights) %>%
#Arrange by observations
arrange(desc(n))| minimum_nights | n |
|---|---|
| 2 | 5993 |
| 3 | 3918 |
| 1 | 2658 |
| 4 | 1367 |
| 5 | 895 |
| 7 | 481 |
| 6 | 234 |
| 14 | 109 |
| 10 | 92 |
| 30 | 63 |
| 8 | 33 |
| 21 | 32 |
| 20 | 30 |
| 25 | 19 |
| 12 | 18 |
| 15 | 18 |
| 28 | 16 |
| 60 | 15 |
| 9 | 12 |
| 90 | 11 |
| 13 | 7 |
| 59 | 7 |
| 180 | 7 |
| 11 | 4 |
| 19 | 4 |
| 23 | 4 |
| 29 | 4 |
| 50 | 4 |
| 58 | 4 |
| 100 | 4 |
| 16 | 3 |
| 27 | 3 |
| 31 | 3 |
| 200 | 3 |
| 300 | 3 |
| 365 | 3 |
| 18 | 2 |
| 26 | 2 |
| 42 | 2 |
| 45 | 2 |
| 99 | 2 |
| 150 | 2 |
| 1e+03 | 2 |
| 24 | 1 |
| 33 | 1 |
| 37 | 1 |
| 38 | 1 |
| 63 | 1 |
| 70 | 1 |
| 75 | 1 |
| 80 | 1 |
| 89 | 1 |
| 94 | 1 |
| 95 | 1 |
| 120 | 1 |
| 181 | 1 |
| 183 | 1 |
| 186 | 1 |
| 222 | 1 |
| 240 | 1 |
| 500 | 1 |
| 999 | 1 |
| 1e+03 | 1 |
| 1.1e+03 | 1 |
By looking into the vairable minimum_nights, we found out that airbnb most commonly required people to stay for at least 2 nights. The majority of users tend to spend less than a week, which indicates that Airbnb is most commonly used for travel purposes, i.e., as an alternative to traditional hotels..
There are also some variables that stand out i.e. minimum nights of 1,100 days (3 years minimum rent). These individuals are probably looking for long term renters.
We only want to include listings in our regression analysis that are intended for travel purposes, i.e. no more than 4 days.
#Filter the Airbnb data so that it only includes observations with `minimum_nights <= 4`
listings2 <- listings2 %>%
filter(minimum_nights <= 4)We then skim the dataset to see if further data wrangling is required.
skim(listings2)| Name | listings2 |
| Number of rows | 13936 |
| Number of columns | 17 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| Date | 2 |
| logical | 2 |
| numeric | 9 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| neighbourhood_cleansed | 0 | 1 | 5 | 9 | 0 | 4 | 0 |
| room_type | 0 | 1 | 10 | 15 | 0 | 4 | 0 |
| license | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| prop_type_simplified | 0 | 1 | 5 | 27 | 0 | 5 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| host_since | 5 | 1.00 | 2008-09-24 | 2021-08-31 | 2015-05-04 | 3121 |
| last_review | 1602 | 0.89 | 2012-05-12 | 2021-09-07 | 2019-06-16 | 1920 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| host_is_superhost | 5 | 1 | 0.13 | FAL: 12118, TRU: 1813 |
| instant_bookable | 0 | 1 | 0.24 | FAL: 10656, TRU: 3280 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| accommodates | 0 | 1.00 | 2.82 | 1.31 | 0.00 | 2.00 | 2.00 | 4.00 | 16.00 | ▇▃▁▁▁ |
| price | 0 | 1.00 | 155.55 | 164.51 | 0.00 | 96.00 | 130.00 | 180.00 | 8000.00 | ▇▁▁▁▁ |
| review_scores_rating | 1602 | 0.89 | 4.69 | 0.65 | 0.00 | 4.67 | 4.85 | 5.00 | 5.00 | ▁▁▁▁▇ |
| number_of_reviews | 0 | 1.00 | 26.39 | 59.57 | 0.00 | 2.00 | 8.00 | 23.00 | 877.00 | ▇▁▁▁▁ |
| minimum_nights | 0 | 1.00 | 2.29 | 0.88 | 1.00 | 2.00 | 2.00 | 3.00 | 4.00 | ▃▇▁▅▂ |
| reviews_per_month | 1602 | 0.89 | 0.72 | 1.78 | 0.01 | 0.11 | 0.28 | 0.65 | 107.84 | ▇▁▁▁▁ |
| bedrooms | 814 | 0.94 | 1.50 | 0.85 | 1.00 | 1.00 | 1.00 | 2.00 | 15.00 | ▇▁▁▁▁ |
| beds | 85 | 0.99 | 1.74 | 1.49 | 0.00 | 1.00 | 1.00 | 2.00 | 33.00 | ▇▁▁▁▁ |
| availability_30 | 0 | 1.00 | 4.07 | 8.71 | 0.00 | 0.00 | 0.00 | 0.00 | 30.00 | ▇▁▁▁▁ |
As we can see, review_scores_rating, reviews_per_month, host_is_superhost, bedrooms and beds still have missing values. We will drop unreviewed units for now and as these units will likely be less accurate in terms of price / review ratings. Similarly for bedrooms and beds, we drop missing values.
#Dropping values
listings2 <- listings2 %>%
#Choosing which variables to drop NAs from
drop_na(bedrooms,beds,host_is_superhost,review_scores_rating,reviews_per_month)
#Skimming to confirm code works as intended.
skim(listings2)| Name | listings2 |
| Number of rows | 11586 |
| Number of columns | 17 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| Date | 2 |
| logical | 2 |
| numeric | 9 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| neighbourhood_cleansed | 0 | 1 | 5 | 9 | 0 | 4 | 0 |
| room_type | 0 | 1 | 10 | 15 | 0 | 4 | 0 |
| license | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| prop_type_simplified | 0 | 1 | 5 | 27 | 0 | 5 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| host_since | 0 | 1 | 2008-09-24 | 2021-08-11 | 2015-03-04 | 2921 |
| last_review | 0 | 1 | 2012-05-12 | 2021-09-07 | 2019-06-14 | 1891 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| host_is_superhost | 0 | 1 | 0.14 | FAL: 9945, TRU: 1641 |
| instant_bookable | 0 | 1 | 0.22 | FAL: 9018, TRU: 2568 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| accommodates | 0 | 1 | 2.85 | 1.30 | 1.00 | 2.00 | 2.00 | 4.00 | 16.00 | ▇▁▁▁▁ |
| price | 0 | 1 | 154.60 | 143.04 | 4.00 | 99.00 | 130.00 | 180.00 | 8000.00 | ▇▁▁▁▁ |
| review_scores_rating | 0 | 1 | 4.70 | 0.65 | 0.00 | 4.67 | 4.86 | 5.00 | 5.00 | ▁▁▁▁▇ |
| number_of_reviews | 0 | 1 | 29.16 | 61.35 | 1.00 | 4.00 | 10.00 | 26.00 | 866.00 | ▇▁▁▁▁ |
| minimum_nights | 0 | 1 | 2.32 | 0.87 | 1.00 | 2.00 | 2.00 | 3.00 | 4.00 | ▃▇▁▅▂ |
| reviews_per_month | 0 | 1 | 0.69 | 1.79 | 0.01 | 0.11 | 0.28 | 0.62 | 107.84 | ▇▁▁▁▁ |
| bedrooms | 0 | 1 | 1.50 | 0.85 | 1.00 | 1.00 | 1.00 | 2.00 | 15.00 | ▇▁▁▁▁ |
| beds | 0 | 1 | 1.76 | 1.51 | 0.00 | 1.00 | 1.00 | 2.00 | 33.00 | ▇▁▁▁▁ |
| availability_30 | 0 | 1 | 3.89 | 8.47 | 0.00 | 0.00 | 0.00 | 0.00 | 30.00 | ▇▁▁▁▁ |
We also know that ratings on Airbnb are 1-5, therefore there must be an issue with reviews that are 0 out of 5.
Let’s remove these from our data
#Dropping reviews at 0
listings2 <- listings2 %>%
filter(review_scores_rating > 0)Let’s take a final skim of the data
skim(listings2)| Name | listings2 |
| Number of rows | 11426 |
| Number of columns | 17 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| Date | 2 |
| logical | 2 |
| numeric | 9 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| neighbourhood_cleansed | 0 | 1 | 5 | 9 | 0 | 4 | 0 |
| room_type | 0 | 1 | 10 | 15 | 0 | 4 | 0 |
| license | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| prop_type_simplified | 0 | 1 | 5 | 27 | 0 | 5 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| host_since | 0 | 1 | 2008-09-24 | 2021-08-11 | 2015-03-01 | 2911 |
| last_review | 0 | 1 | 2012-05-12 | 2021-09-07 | 2019-06-16 | 1877 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| host_is_superhost | 0 | 1 | 0.14 | FAL: 9785, TRU: 1641 |
| instant_bookable | 0 | 1 | 0.22 | FAL: 8889, TRU: 2537 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| accommodates | 0 | 1 | 2.85 | 1.30 | 1.00 | 2.00 | 2.00 | 4.00 | 16.00 | ▇▁▁▁▁ |
| price | 0 | 1 | 153.68 | 122.98 | 4.00 | 99.00 | 130.00 | 180.00 | 7999.00 | ▇▁▁▁▁ |
| review_scores_rating | 0 | 1 | 4.77 | 0.33 | 1.00 | 4.67 | 4.86 | 5.00 | 5.00 | ▁▁▁▁▇ |
| number_of_reviews | 0 | 1 | 29.55 | 61.69 | 1.00 | 4.00 | 11.00 | 26.00 | 866.00 | ▇▁▁▁▁ |
| minimum_nights | 0 | 1 | 2.32 | 0.87 | 1.00 | 2.00 | 2.00 | 3.00 | 4.00 | ▃▇▁▅▂ |
| reviews_per_month | 0 | 1 | 0.70 | 1.80 | 0.01 | 0.12 | 0.28 | 0.63 | 107.84 | ▇▁▁▁▁ |
| bedrooms | 0 | 1 | 1.50 | 0.85 | 1.00 | 1.00 | 1.00 | 2.00 | 15.00 | ▇▁▁▁▁ |
| beds | 0 | 1 | 1.76 | 1.51 | 0.00 | 1.00 | 1.00 | 2.00 | 33.00 | ▇▁▁▁▁ |
| availability_30 | 0 | 1 | 3.89 | 8.47 | 0.00 | 0.00 | 0.00 | 0.00 | 30.00 | ▇▁▁▁▁ |
We now have wrangled our data sufficiently and have: - 4 character variables ready for analysis with not too many categories. - 2 Date variables with no missing values - 2 Logical variables with no missing variables - 9 range of numeric variables with no missing values.
We also have 11.4 thousand observations.
We will now explore how these variables affect price.
By visualising the relationship between price and the variables that we believe will affect the price substantially, we hope to identify a noticeable correlation between price and each of those variables and use statistical modelling to further investigate their relationships and provide possible explanation.
We first take a look at the price.
#Overview of price
favstats(listings2$price)| min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|
| 4 | 99 | 130 | 180 | 8e+03 | 154 | 123 | 11426 | 0 |
We see that price ranges from 4 to 8,000 per night with a mean of 154 and median of 130 (i.e. some left skew)
We create the density plot of price.
#Create a density plot of prices
ggplot(listings2,aes(x=price))+
#Using a density plot
geom_density()+
#Adding useful titles
labs(title="Density plot of listing Airbnb prices",
x="Price",
y="Density") +
#Minimal theme
theme_minimal()#Create a density plot of log of prices
ggplot(listings2,aes(x=log(price)))+
#Density plot
geom_density()+
#Useful Labes
labs(title="Density plot of listing Airbnb (log) prices",
x="Log Price",
y="Density") +
#Minimal Theme
theme_minimal()From here forward we will use Log which is more normally distributed. We can see that there are some peaks and lows on the graph which we connect to well-known disposition of people to round numbers, i.e. setting price to be USD 140 instead of USD 137 or USD 142. We assume that the graph would be smooth and almost perfectly normally distributed without this feature
First, we skimmed the neighbourhood_cleansed data to analyse the distribution of price of listings depends on neighbourhood. From the table below, we found out that median price will better suits the purpose of our research than the mean price because the distribution of price of each neighbourhood is greaatly skewed with extreme outliers. For example all neighbourhoods have lower median than mean price while the most expensive property costs USD8,000, causing a large standard deviation. Hence, standard deviation is unusually large among neighbourhood.
#Analyse the distribution of price of listings depends on neighbourhood
favstats(price~neighbourhood_cleansed,data=listings2) %>%
#Arrange by count
arrange(desc(n)) | neighbourhood_cleansed | min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|---|
| Top 5 | 4 | 104 | 149 | 200 | 2e+03 | 172 | 113 | 4647 | 0 |
| Top 6-10 | 10 | 99 | 130 | 175 | 8e+03 | 151 | 149 | 4349 | 0 |
| Top 11-15 | 25 | 89 | 118 | 150 | 800 | 128 | 66.5 | 1705 | 0 |
| Remaining | 22 | 65 | 95 | 136 | 999 | 113 | 84.2 | 725 | 0 |
We then plot a density graph to see the price distribution of the most popular neighbourhood.
#Create the dataset of the top neighbourhood and the price
listings_neigh<-listings2 %>%
#Group by variable we are looking at
group_by(neighbourhood_cleansed) %>%
#ANd get the price
summarize(price=price)
listings_neigh$new <- factor(listings_neigh$neighbourhood_cleansed, levels = c("Top 5", "Top 6-10", "Top 11-15", "Remaining"))
#Plot the density graph
ggplot(listings_neigh,aes(x=log(price), y = desc(new), fill = new))+
#Density Ridges Plit
geom_density_ridges(alpha=0.3) +
#Useful lables
labs(
title = "Difference in price by region expensiveness",
subtitle="Density plot for prices in different region groups",
x = "Price per night (log)",
y = "Density") +
theme_classic() +
#removing y-axis
theme(axis.title.y=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank()) +
#Title of Legend
guides(fill=guide_legend(title="Neighbourhood regions"))+
#Plotting a vertical line at the median
geom_vline(xintercept = median(log(listings2$price)), colour = "#001e62")+
NULLThe graph shows that the neighbourhoods vary in price and will likely be a good predictor for our analysis later on.
Let’s look at the median price by neighbourhood as well:
#Calculate the median price of different neighborhood
median_per_neighborhood <- listings2 %>%
#Grouping by variable
group_by(neighbourhood_cleansed) %>%
#We want median price
summarise(median_price = median(price))
#Plot the graph
ggplot(median_per_neighborhood,
#Price versus neighbourhood
aes(x = reorder(neighbourhood_cleansed, median_price),
y= median_price)) +
#Columns chart
geom_col(fill = "skyblue") +
#USeful labes
labs(
title = "Median price per night per Neighborhood",
x="Neighborhood",
y="Median Price"
) +
#To flip axis
coord_flip() +
#Simple theme
theme_bw() +
NULLPlotting on a median price basis indicates that these prices are still indeed impacted by how close they are to center.
We skimmed our data to summarise the price distribution based on number of bedrooms.
From the table below, we noticed that listings with more than five bedrooms have less than 10 samples, therefore, we decided to only include properties with up to five bedrooms. Furthermore, we found out that median price will better suit the purpose of our research than the mean price because the distribution of price based on number of bedrooms is greatly skewed due to extreme outliers. Therefore, we believe that the median price gives better understanding of the relationship between the price and the number of bedrooms
#Analyse the distribution of price of listings depends on number of bedrooms
favstats(price~bedrooms, data=listings2)| bedrooms | min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 4 | 86 | 112 | 149 | 8e+03 | 124 | 113 | 7420 | 0 |
| 2 | 42 | 130 | 172 | 225 | 2e+03 | 189 | 94.9 | 2828 | 0 |
| 3 | 29 | 150 | 199 | 270 | 1.92e+03 | 232 | 143 | 853 | 0 |
| 4 | 25 | 194 | 250 | 350 | 1.16e+03 | 293 | 165 | 260 | 0 |
| 5 | 125 | 247 | 330 | 478 | 814 | 383 | 191 | 40 | 0 |
| 6 | 50 | 409 | 443 | 493 | 550 | 389 | 197 | 5 | 0 |
| 7 | 487 | 487 | 487 | 487 | 487 | 487 | 1 | 0 | |
| 8 | 50 | 162 | 530 | 561 | 811 | 422 | 308 | 6 | 0 |
| 9 | 557 | 557 | 557 | 557 | 557 | 557 | 1 | 0 | |
| 10 | 443 | 696 | 725 | 797 | 857 | 704 | 159 | 5 | 0 |
| 12 | 83 | 83 | 83 | 83 | 857 | 212 | 316 | 6 | 0 |
| 15 | 120 | 120 | 120 | 120 | 120 | 120 | 1 | 0 |
We would like to plot this information
#Filter the bedrooms under 5 to remove outliers
listings2 <- listings2 %>%
filter(bedrooms<=5)
#Median calculations
median_per_bedroom <- listings2 %>%
#By Bedroom
group_by(bedrooms) %>%
#Summarized by median price
summarise(median_price = median(price))
#Plotting the graph
ggplot(median_per_bedroom,
#Bedroom by price
aes(x = reorder(bedrooms,median_price),
#Price on Y Axis
y= median_price)) +
#in Column format
geom_col(fill = "orchid3") +
#With useful labelss
labs(
title = "Median price per night per bedrooms",
x="Number of Bedrooms",
y="Median Price"
) +
#To flip axis
coord_flip() +
#Simple theme
theme_bw() +
NULLWe see a strong relationship with number of bedrooms and price.
We assume number of bedrooms will have high correlation with accommodation and number of beds and will therefore leave those two variables out for now.
We skimmed our data to summarise the price distribution based on the rating score.
#Analyse the distribution of price of listings depends on the rating score
favstats(price~review_scores_rating,data=listings2)| review_scores_rating | min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 47 | 61.2 | 99.5 | 163 | 400 | 145 | 120 | 14 | 0 |
| 2 | 60 | 79.8 | 92.5 | 116 | 225 | 112 | 54.6 | 12 | 0 |
| 2.33 | 121 | 121 | 121 | 121 | 121 | 121 | 1 | 0 | |
| 2.5 | 65 | 85 | 105 | 122 | 140 | 103 | 37.5 | 3 | 0 |
| 2.75 | 130 | 130 | 130 | 130 | 130 | 130 | 1 | 0 | |
| 2.8 | 149 | 149 | 149 | 149 | 149 | 149 | 1 | 0 | |
| 3 | 40 | 100 | 130 | 166 | 579 | 146 | 89.4 | 63 | 0 |
| 3.25 | 65 | 67.5 | 70 | 75 | 80 | 71.7 | 7.64 | 3 | 0 |
| 3.29 | 120 | 120 | 120 | 120 | 120 | 120 | 1 | 0 | |
| 3.33 | 50 | 120 | 130 | 150 | 200 | 130 | 54.3 | 5 | 0 |
| 3.4 | 133 | 133 | 133 | 133 | 133 | 133 | 1 | 0 | |
| 3.5 | 30 | 100 | 117 | 145 | 345 | 138 | 80.8 | 23 | 0 |
| 3.56 | 91 | 112 | 133 | 154 | 175 | 133 | 59.4 | 2 | 0 |
| 3.57 | 84 | 84 | 84 | 84 | 84 | 84 | 1 | 0 | |
| 3.6 | 70 | 85 | 100 | 138 | 176 | 115 | 54.6 | 3 | 0 |
| 3.63 | 125 | 134 | 144 | 154 | 163 | 144 | 26.9 | 2 | 0 |
| 3.67 | 29 | 83.8 | 115 | 165 | 286 | 133 | 69.6 | 16 | 0 |
| 3.69 | 129 | 129 | 129 | 129 | 129 | 129 | 1 | 0 | |
| 3.71 | 124 | 148 | 172 | 196 | 220 | 172 | 67.9 | 2 | 0 |
| 3.75 | 95 | 135 | 150 | 219 | 373 | 180 | 87.9 | 9 | 0 |
| 3.78 | 90 | 90 | 90 | 90 | 90 | 90 | 1 | 0 | |
| 3.8 | 96 | 126 | 152 | 187 | 262 | 163 | 59.2 | 6 | 0 |
| 3.83 | 83 | 103 | 114 | 122 | 130 | 110 | 20.1 | 4 | 0 |
| 3.86 | 65 | 73.8 | 82.5 | 91.2 | 100 | 82.5 | 24.7 | 2 | 0 |
| 3.88 | 60 | 102 | 144 | 322 | 500 | 235 | 234 | 3 | 0 |
| 3.89 | 80 | 122 | 165 | 208 | 250 | 165 | 120 | 2 | 0 |
| 4 | 25 | 90 | 120 | 150 | 999 | 137 | 88.2 | 291 | 0 |
| 4.03 | 60 | 66.2 | 72.5 | 78.8 | 85 | 72.5 | 17.7 | 2 | 0 |
| 4.04 | 138 | 138 | 138 | 138 | 138 | 138 | 1 | 0 | |
| 4.05 | 48 | 59 | 70 | 120 | 171 | 96.3 | 65.6 | 3 | 0 |
| 4.07 | 77 | 100 | 145 | 150 | 155 | 125 | 34.8 | 5 | 0 |
| 4.08 | 96 | 100 | 101 | 125 | 760 | 236 | 293 | 5 | 0 |
| 4.09 | 58 | 82 | 110 | 132 | 140 | 104 | 37.8 | 4 | 0 |
| 4.1 | 49 | 174 | 298 | 349 | 400 | 249 | 181 | 3 | 0 |
| 4.11 | 60 | 87 | 100 | 160 | 230 | 127 | 56.6 | 9 | 0 |
| 4.13 | 78 | 80 | 121 | 199 | 461 | 164 | 111 | 13 | 0 |
| 4.14 | 65 | 85 | 130 | 140 | 760 | 164 | 182 | 13 | 0 |
| 4.15 | 54 | 83.5 | 95.5 | 120 | 180 | 105 | 43.5 | 6 | 0 |
| 4.16 | 140 | 140 | 140 | 140 | 140 | 140 | 1 | 0 | |
| 4.17 | 85 | 100 | 140 | 160 | 599 | 162 | 119 | 17 | 0 |
| 4.18 | 37 | 108 | 132 | 167 | 200 | 132 | 52.9 | 8 | 0 |
| 4.19 | 75 | 112 | 150 | 225 | 300 | 175 | 115 | 3 | 0 |
| 4.2 | 49 | 79.2 | 104 | 137 | 400 | 119 | 70.5 | 30 | 0 |
| 4.21 | 50 | 100 | 150 | 165 | 180 | 127 | 68.1 | 3 | 0 |
| 4.22 | 63 | 111 | 146 | 198 | 400 | 161 | 87.8 | 14 | 0 |
| 4.23 | 49 | 85.5 | 100 | 131 | 453 | 135 | 111 | 11 | 0 |
| 4.24 | 34 | 63.8 | 84.5 | 141 | 214 | 102 | 60.3 | 8 | 0 |
| 4.25 | 35 | 81 | 103 | 144 | 599 | 131 | 98 | 58 | 0 |
| 4.26 | 61 | 79.5 | 98 | 116 | 135 | 98 | 52.3 | 2 | 0 |
| 4.27 | 52 | 85.2 | 100 | 121 | 297 | 114 | 64.4 | 12 | 0 |
| 4.28 | 60 | 69.8 | 79.5 | 89.2 | 99 | 79.5 | 27.6 | 2 | 0 |
| 4.29 | 46 | 88.5 | 118 | 146 | 229 | 123 | 46.6 | 18 | 0 |
| 4.3 | 39 | 84.2 | 97.5 | 135 | 200 | 109 | 46.4 | 16 | 0 |
| 4.31 | 56 | 66 | 89 | 100 | 319 | 109 | 73.2 | 13 | 0 |
| 4.32 | 33 | 47.2 | 55 | 112 | 130 | 74.5 | 38.7 | 8 | 0 |
| 4.33 | 40 | 81.5 | 118 | 152 | 1.15e+03 | 139 | 120 | 124 | 0 |
| 4.34 | 45 | 55 | 60 | 68 | 139 | 73.4 | 37.6 | 5 | 0 |
| 4.35 | 45 | 85 | 147 | 195 | 275 | 148 | 76.4 | 13 | 0 |
| 4.36 | 70 | 99.8 | 117 | 150 | 202 | 123 | 35.9 | 16 | 0 |
| 4.37 | 66 | 85 | 99 | 100 | 125 | 94.3 | 19.2 | 7 | 0 |
| 4.38 | 35 | 108 | 130 | 171 | 252 | 137 | 50.8 | 28 | 0 |
| 4.39 | 59 | 73.5 | 130 | 176 | 8e+03 | 840 | 2.38e+03 | 11 | 0 |
| 4.4 | 22 | 79.8 | 124 | 168 | 740 | 143 | 108 | 52 | 0 |
| 4.41 | 33 | 85.5 | 125 | 160 | 270 | 133 | 66.2 | 15 | 0 |
| 4.42 | 42 | 89.8 | 125 | 148 | 250 | 127 | 53.7 | 22 | 0 |
| 4.43 | 34 | 100 | 122 | 161 | 430 | 135 | 65.5 | 48 | 0 |
| 4.44 | 58 | 94.2 | 115 | 149 | 283 | 128 | 54.7 | 24 | 0 |
| 4.45 | 38 | 105 | 143 | 206 | 410 | 166 | 93.3 | 25 | 0 |
| 4.46 | 46 | 86 | 130 | 186 | 250 | 141 | 65.8 | 22 | 0 |
| 4.47 | 44 | 88.8 | 102 | 158 | 500 | 148 | 111 | 24 | 0 |
| 4.48 | 65 | 105 | 130 | 158 | 329 | 154 | 81.6 | 18 | 0 |
| 4.49 | 58 | 80 | 90 | 118 | 188 | 99.8 | 36.1 | 11 | 0 |
| 4.5 | 29 | 90 | 120 | 160 | 2e+03 | 141 | 120 | 430 | 0 |
| 4.51 | 43 | 91.5 | 122 | 162 | 264 | 135 | 77.3 | 6 | 0 |
| 4.52 | 68 | 80 | 103 | 140 | 707 | 151 | 155 | 17 | 0 |
| 4.53 | 45 | 82 | 105 | 150 | 240 | 118 | 49.1 | 41 | 0 |
| 4.54 | 33 | 66 | 96 | 145 | 286 | 106 | 52.6 | 33 | 0 |
| 4.55 | 28 | 80 | 113 | 163 | 389 | 131 | 74.1 | 51 | 0 |
| 4.56 | 47 | 98.8 | 124 | 152 | 250 | 131 | 49.1 | 64 | 0 |
| 4.57 | 25 | 81.5 | 108 | 150 | 740 | 141 | 111 | 94 | 0 |
| 4.58 | 65 | 98.8 | 130 | 162 | 895 | 162 | 134 | 48 | 0 |
| 4.59 | 50 | 81.8 | 130 | 193 | 464 | 146 | 88 | 34 | 0 |
| 4.6 | 40 | 95 | 125 | 164 | 599 | 140 | 81.9 | 136 | 0 |
| 4.61 | 58 | 89.5 | 110 | 172 | 695 | 149 | 115 | 35 | 0 |
| 4.62 | 26 | 75 | 107 | 150 | 450 | 122 | 74.7 | 42 | 0 |
| 4.63 | 29 | 91.5 | 120 | 164 | 550 | 143 | 85.5 | 115 | 0 |
| 4.64 | 26 | 89.8 | 130 | 186 | 1.07e+03 | 166 | 149 | 68 | 0 |
| 4.65 | 32 | 91.2 | 133 | 188 | 367 | 147 | 82.5 | 42 | 0 |
| 4.66 | 30 | 63.5 | 109 | 180 | 399 | 131 | 90.3 | 23 | 0 |
| 4.67 | 17 | 90 | 120 | 164 | 599 | 138 | 75.7 | 435 | 0 |
| 4.68 | 40 | 98.5 | 125 | 163 | 800 | 154 | 110 | 63 | 0 |
| 4.69 | 30 | 90 | 115 | 168 | 450 | 140 | 77.7 | 88 | 0 |
| 4.7 | 35 | 99 | 130 | 172 | 350 | 140 | 65.3 | 95 | 0 |
| 4.71 | 42 | 99 | 120 | 153 | 400 | 135 | 59.1 | 165 | 0 |
| 4.72 | 33 | 84.5 | 117 | 144 | 300 | 128 | 65.1 | 56 | 0 |
| 4.73 | 30 | 85 | 120 | 170 | 800 | 144 | 101 | 124 | 0 |
| 4.74 | 30 | 83 | 120 | 166 | 500 | 141 | 86.4 | 71 | 0 |
| 4.75 | 37 | 95 | 129 | 180 | 829 | 148 | 87.4 | 360 | 0 |
| 4.76 | 34 | 122 | 145 | 180 | 500 | 158 | 78.7 | 73 | 0 |
| 4.77 | 50 | 95 | 125 | 162 | 729 | 146 | 92.8 | 99 | 0 |
| 4.78 | 25 | 96 | 130 | 190 | 1.16e+03 | 155 | 112 | 153 | 0 |
| 4.79 | 27 | 96 | 146 | 200 | 518 | 163 | 89.4 | 137 | 0 |
| 4.8 | 25 | 100 | 120 | 173 | 760 | 145 | 78.2 | 308 | 0 |
| 4.81 | 28 | 92 | 124 | 164 | 463 | 146 | 87.9 | 141 | 0 |
| 4.82 | 20 | 98.2 | 135 | 193 | 699 | 156 | 92.9 | 158 | 0 |
| 4.83 | 29 | 100 | 135 | 186 | 800 | 157 | 89.8 | 290 | 0 |
| 4.84 | 25 | 90.8 | 129 | 166 | 534 | 145 | 86.3 | 104 | 0 |
| 4.85 | 4 | 95 | 124 | 194 | 550 | 149 | 86.1 | 151 | 0 |
| 4.86 | 45 | 100 | 139 | 198 | 950 | 161 | 96 | 268 | 0 |
| 4.87 | 26 | 90 | 124 | 179 | 440 | 139 | 68 | 145 | 0 |
| 4.88 | 47 | 100 | 125 | 194 | 1.19e+03 | 158 | 101 | 298 | 0 |
| 4.89 | 38 | 96.5 | 139 | 180 | 550 | 156 | 90.6 | 251 | 0 |
| 4.9 | 31 | 100 | 131 | 186 | 500 | 154 | 77.4 | 216 | 0 |
| 4.91 | 27 | 100 | 140 | 190 | 500 | 157 | 79.7 | 197 | 0 |
| 4.92 | 30 | 100 | 135 | 185 | 900 | 160 | 97.5 | 242 | 0 |
| 4.93 | 50 | 109 | 150 | 200 | 2e+03 | 172 | 146 | 220 | 0 |
| 4.94 | 35 | 106 | 144 | 200 | 745 | 171 | 104 | 176 | 0 |
| 4.95 | 50 | 102 | 142 | 200 | 504 | 164 | 82.6 | 140 | 0 |
| 4.96 | 28 | 100 | 136 | 196 | 399 | 157 | 80.2 | 122 | 0 |
| 4.97 | 43 | 99 | 139 | 199 | 350 | 154 | 67.8 | 93 | 0 |
| 4.98 | 45 | 95 | 136 | 198 | 950 | 183 | 160 | 44 | 0 |
| 4.99 | 80 | 110 | 139 | 156 | 279 | 146 | 61.6 | 9 | 0 |
| 5 | 10 | 101 | 140 | 190 | 1.92e+03 | 163 | 104 | 3404 | 0 |
By drawing a regression line between the median price and the ratings, we concluded that they share a positive correlation, which means higher ratings have higher median price, and vice versa.
We will use two different methods for this, first we will use the median for each rating creating a smaller subset of datapoints:
##Create the data of price and score rating
median_per_rating <- listings2 %>%
#Group by review
group_by(review_scores_rating) %>%
#Summarize by median price
summarise(median_price = median(price)) %>%
#Exclude NA's
na.omit()
#Plotting
ggplot(median_per_rating,
#Review scores vs median price
aes(x = review_scores_rating,
y= median_price)) +
#Coloring red dots
geom_point(color="red") +
#Adding useful titles
labs(title = "Median price per night vs. Ratings",
x="Ratings",
y="Median Price"
) +
#Adding a trend line
geom_smooth(method=lm,colour="black",alpha=0)+
#Simple theme
theme_bw() +
ylim(0, 200) +
NULLWe see some relationship between price and ratings. Let’s explore this further and see if this depends on the room type.
#Plotting by rating but log of price
ggplot(listings2, aes(x = review_scores_rating, y = log(price), color = room_type)) +
#In red color
geom_point(alpha = 0.5) +
#And adding a trendline
geom_smooth(se = F, method = "lm", color = "black") +
#In minimal Theme
theme_bw() +
#Adding useful titles
labs(title = "Log price per night vs. Ratings",
x="Ratings",
y="Log Price"
) +
facet_wrap(~ room_type, scales = "free") +
guides(color=F) +
NULLInterestingly, especially hotel rooms are sensitive to review ratings.
Let’s first look at a key summary
#Analyse the distribution of price of listings depends on the rating score
favstats(price~host_is_superhost,data=listings2)| host_is_superhost | min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|---|
| FALSE | 4 | 99 | 130 | 180 | 8e+03 | 154 | 126 | 9767 | 0 |
| TRUE | 27 | 88 | 127 | 180 | 900 | 150 | 92.6 | 1634 | 0 |
Interestingly, non-superhosts seem to have a higher price than superhosts. Let’s plot some relationships to explore this further.
#Create the data of price, score rating, and superhost
median_per_rating1 <- listings2 %>%
#grouping by review score
group_by(review_scores_rating) %>%
#Summarizing by price and superhost
summarise(median_price = median(price),
host_is_superhost=host_is_superhost) %>%
#Excluding NA's
na.omit()
#Plotting the dataset
ggplot(median_per_rating1,
#By same as above but including host is superhost
aes(x = review_scores_rating,
y= median_price,
colour = host_is_superhost)) +
#Adding scatterplot
geom_point() +
#Adding useful labels
labs(title = "Median price per night vs. Ratings",
x="Ratings",
y="Median Price"
) +
#Adding Trendline
geom_smooth(method=lm,colour="black",alpha=0)+
#Adding a simple theme
theme_bw() +
#Adding a legend
guides(color=guide_legend(title="Host is Superhost"))+
#Wrapping by host type to get two different charts
facet_wrap(~host_is_superhost) +
NULLAlthough both mean and median price for listings by non-superhost were higher than that by superhost, the regression analysis above shows that the correlation between ratings and median prices is higher for superhost. Therefore, the median price of listings by superhost is more affected by the change in ratings than that by non-superhost.
Now let’s also take a look at the effect from superhosts themselves:
#Plotting by host type
ggplot(listings2,
#Adding key variables
aes(x=host_is_superhost,
y=log(price),
fill = host_is_superhost))+
#Looking at a boxplot for distribution
geom_boxplot()+
#Adding useful labels
labs(title = "Superhost versus non superhosts price distributions",
x="Superhost",
y=" Log Price"
)+
#Removing legend
guides(fill = F) +
#Simple Theme
theme_bw() +
NULL With more extreme outliers, the box plot of non-superhosts shows higher standard deviation compared to that of superhost.Let’s confirm this:
#Plot the density graph
ggplot(listings2,
aes(x=log(price),
y = host_is_superhost,
fill = host_is_superhost))+
#Density Ridges Plit
geom_density_ridges(alpha=0.3) +
#Useful lables
labs(
title = "Prices by host type",
x = "Price per night (log)",
y = "Density") +
theme_classic() +
#Title of Legend
guides(fill=F)+
#Plotting a vertical line at the median
geom_vline(xintercept = median(log(listings2$price)), colour = "#001e62")+
NULLBased on the graph above, the distribution of superhost seems to have a greater variation. However, the distribution of superhost has a lower standard deviation than that of non-superhost. Therefore, we believe that the graph above is greatly influenced by the number of samples included, i.e. 1,634 superhost vs 9,767 non-superhost.
Similarly we can graph a relationship between number of reviews and price.
As this is a numerical variable with many ranges we will not use favstats to look into this.
ggplot(listings2,
#Plotting reviews by price
aes(x = number_of_reviews,
y= log(price))) +
#Adding color to the chart
geom_point(color="red", alpha = 0.3) +
#Adding useful labels
labs(title = "Price per night vs. number of Reviews",
x="Number of Reviews",
y="Log Price"
) +
#Adding a trendline
geom_smooth(method=lm,colour="black",alpha=0)+
#Adding BW Theme
theme_bw() +
NULLWe see that as the number of reviews increases the price falls.
Let’s check if this holds across categorical variables
ggplot(listings2,
#Plotting reviews by price
aes(x = number_of_reviews,
y= log(price),
color = room_type)) +
#Adding color to the chart
geom_point(alpha = 0.5) +
#Adding useful labels
labs(title = "Price per night vs. number of Reviews",
x="Number of Reviews",
y="Log Price"
) +
#Adding a trendline
geom_smooth(method=lm,colour="black",alpha=0)+
#Adding BW Theme
theme_bw() +
#Wrapping by host type to get two different charts
facet_wrap(~room_type, scales ="free") +
guides(color = F) +
NULL This is super interesting to see because we see that the relationship depends on room types. Therefore it is important to control for room type when looking at the effect of number of reviews!
Let’s first look at a summary of the data:
#Analyse the distribution of price of listings depends on the rating score
favstats(price~license,data=listings2)| license | min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|---|
| no | 4 | 99 | 129 | 175 | 8e+03 | 151 | 130 | 8031 | 0 |
| yes | 25 | 96 | 136 | 195 | 1.19e+03 | 159 | 101 | 3370 | 0 |
By graphing a boxplot, we see that properties with license have higher Q1, median and Q3 value than properties without license. Hence, we conclude that properties with licence are more expensive than the others.
#Plotting the data
ggplot(listings2,
aes(x=license,
y=log(price)))+
#In Wheat Color
geom_boxplot(fill="wheat2")+
#Useful labels
labs(title = "Price per night vs. License",
x="License",
y=" Price"
)+
#BW Theme
theme_bw()+
NULL We see that there is a higher spread on prices of properties without licenses and a slightly lower median price.
#Analyse the distribution of price of listings depends on the rating score
favstats(price~room_type,data=listings2)| room_type | min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|---|
| Entire home/apt | 4 | 110 | 147 | 199 | 8e+03 | 169 | 128 | 8731 | 0 |
| Hotel room | 26 | 83 | 120 | 156 | 900 | 138 | 135 | 80 | 0 |
| Private room | 17 | 65 | 85 | 115 | 2e+03 | 99.5 | 74 | 2557 | 0 |
| Shared room | 20 | 37 | 65 | 82 | 336 | 86 | 71.5 | 33 | 0 |
There seems to be an extremely strong relationship between room type and price. This is good for our analysis but let’s first plot it further.
First let’s visualize the median price:
#Calculate the median price of different room types
median_per_room <- listings2 %>%
#By Room
group_by(room_type) %>%
#In median Prices
summarise(median_price = median(price))
#Plot the graph
ggplot(median_per_room,
#Reorder by price
aes(x = reorder(room_type, median_price),
#Price on Y Axis
y= median_price)) +
#Column Chart
geom_col(fill = "skyblue") +
#Useful Labels
labs(
title = "Median price by room_type",
x="Room Type",
y="Median Price"
) +
#Flipping for aesthetics
coord_flip() +
#Simple theme for aesthetics
theme_bw() +
NULL We see that entire homes and hotel rooms are especially expensive. We would like to plot the density of this to explore it further:
#Density Plot
ggplot(listings2,
aes(x=log(price),
y = room_type,
fill = room_type))+
#Density Ridges Plit
geom_density_ridges(alpha=0.3) +
#Useful lables
labs(
title = "Prices by room type",
x = "Price per night (log)",
y = "Density") +
theme_classic() +
#Title of Legend
guides(fill=F)+
#Plotting a vertical line at the median
geom_vline(xintercept = median(log(listings2$price)), colour = "#001e62")+
NULL Interestingly although entire homes or apartments are the most expensive in terms of median, we see that some hotels are also extremely expensive.
We also see that these distributions are multimodal - i.e. prices are clustered around specific levels.
#Using above dataset for the confidence interval calculations
formula_ci <- listings2%>%
group_by(room_type) %>%
#Calculate weight's summary statistics for people exercising at least 3 times a week
# calculate mean, SD, count, SE, lower/upper 95% CI
summarise(
average_price=mean(price,na.rm=TRUE), #Mean, we choose to ignore any missing values by setting the 'na.rm = TRUE'
sd_price=sd(price,na.rm=TRUE), #Standard Deviation
count= n(), #Observations
t_critical = qt(0.975,count-1), #T-Critical at 95% Confidence Interval and these observations
se_price=sd_price/sqrt(count), #Standard Error
margin_of_error= t_critical*se_price, #Margin of Error
price_low= average_price - margin_of_error, #Lower interval
price_high= average_price + margin_of_error) #Upper Interval
formula_ci| room_type | average_price | sd_price | count | t_critical | se_price | margin_of_error | price_low | price_high |
|---|---|---|---|---|---|---|---|---|
| Entire home/apt | 169 | 128 | 8731 | 1.96 | 1.37 | 2.69 | 167 | 172 |
| Hotel room | 138 | 135 | 80 | 1.99 | 15.1 | 30 | 108 | 168 |
| Private room | 99.5 | 74 | 2557 | 1.96 | 1.46 | 2.87 | 96.6 | 102 |
| Shared room | 86 | 71.5 | 33 | 2.04 | 12.4 | 25.4 | 60.6 | 111 |
ggplot(formula_ci, aes(x=average_price, y=room_type, color=room_type)) +
#geom_errorbar function allows us to show the two bars with confidence intervals
geom_errorbar(aes(xmin=price_low, xmax=price_high),width = 0.1, size=0.5)+
geom_point(aes(x=average_price),size=1)+
theme_bw() +
theme(legend.position = "none",axis.title.y=element_blank())+
labs(title = "Confidence interval for average price per room type",
subtitle = "95% confidence intervals overlap",
x = "Average Price"
) +
NULL The confidence intervals’ lengths are very different between each room type. Shared and hotel rooms have wide widths. There is a big uncertainty related to their average price. On the other side, we can estimate quite accurately the entire and private homes’ average.
Let us look into the statistics first:
#Analyse the distribution of price of listings depends on the rating score
favstats(price~room_type,data=listings2)| room_type | min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|---|
| Entire home/apt | 4 | 110 | 147 | 199 | 8e+03 | 169 | 128 | 8731 | 0 |
| Hotel room | 26 | 83 | 120 | 156 | 900 | 138 | 135 | 80 | 0 |
| Private room | 17 | 65 | 85 | 115 | 2e+03 | 99.5 | 74 | 2557 | 0 |
| Shared room | 20 | 37 | 65 | 82 | 336 | 86 | 71.5 | 33 | 0 |
There seems to be a very strong relationship between property type and price. Lets plot the medians again:
#Calculate the median price of different room types
median_per_prop <- listings2 %>%
group_by(prop_type_simplified) %>%
summarise(median_price = median(price))
#Plot the graph
ggplot(median_per_prop,
aes(x = reorder(prop_type_simplified, median_price),
y= median_price)) +
#Column type
geom_col(fill = "skyblue") +
#Useful labels
labs(
title = "Median price by property type",
x="Room Type",
y="Median Price"
) +
#Flipping for aesthetics
coord_flip() +
#Simple theme
theme_bw() +
NULLThis is very useful but let’s also take a look at the distribution:
#Density Plot
ggplot(listings2,
aes(x=log(price),
y = prop_type_simplified,
fill = prop_type_simplified))+
#Density Ridges Plit
geom_density_ridges(alpha=0.3) +
#Useful lables
labs(
title = "Prices by property type",
x = "Price per night (log)",
y = "Density") +
theme_classic() +
#Title of Legend
guides(fill=F)+
#Plotting a vertical line at the median
geom_vline(xintercept = median(log(listings2$price)), colour = "#001e62")+
NULL There is a very distinct pattern between property types and this will likely be a useful regressor for our analysis.
Let’s also look at availability next 30 days which we hypothesize will be an indicator of popularity.
#Summarizing median price by availability
listings_availability<-listings2 %>%
group_by(availability_30) %>%
summarize(price=median(price))
#Plotting scatterplot of availability
ggplot(listings_availability,
aes(x = availability_30,
y= log(price))) +
#Points in red
geom_point(color="red") +
#Useful Labels
labs(title = "Log Price Per Night versus Availability",
x="Availability next 30 days",
y="Log Price"
) +
#Adding a trendline
geom_smooth(method=lm,colour="black",alpha=0)+
#Simplified theme
theme_bw() +
NULL The more availability, the higher the price, which indicates that more premium rooms are booked less often or are less often fully booked.
This goes against our initial thought that less availability would be popular rooms and therefore more expensive! Good that we looked into this.
Let’s also look at whether or not units are instantly bookable
#Analyse the distribution of price of listings depends on the rating score
favstats(price~instant_bookable,data=listings2)| instant_bookable | min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|---|
| FALSE | 4 | 100 | 130 | 180 | 2e+03 | 154 | 94.7 | 8869 | 0 |
| TRUE | 25 | 89 | 125 | 175 | 8e+03 | 150 | 188 | 2532 | 0 |
Instant bookable seems to be slightly higher distributed. Let’s take a look at the graph:
#Boxplotting instant bookable vs log prices
ggplot(listings2,aes(x=instant_bookable,y=log(price)))+
#Boxplot format
geom_boxplot(fill="wheat2")+
#Useful labels
labs(title = "Distribution of prices depending on whether or not units are instantly bookable",
x="Instantly Bookable",
y="Log Price"
)+
#Simople Theme
theme_bw() There seems to be little to no difference between whether or not the units are bookable, however, we will still look further into this in the regression.
Let’s also look at when they have been host since:
#Plotting scatterplot of availability
ggplot(listings2,
aes(x = host_since,
y= log(price))) +
#Points in red
geom_point(color="red", alpha = 0.2) +
#Useful Labels
labs(title = "Log price per night versus how long the host has been active on Airbnb",
x="Host since (date)",
y="Log Price"
) +
#Adding a trendline
geom_smooth(method=lm,colour="black",alpha=0)+
#Simplified theme
theme_bw() +
NULLThere seems to be little to no relationship here and we will likely not need to consider it going forward.
Let’s check across categoris
#Plotting scatterplot of availability
ggplot(listings2,
aes(x = host_since,
y= log(price),
color = prop_type_simplified)) +
#Points in red
geom_point(alpha = 0.5) +
#Useful Labels
labs(title = "Log price per night versus how long the host has been active on Airbnb",
x="Host since (date)",
y="Log Price"
) +
#Adding a trendline
geom_smooth(method=lm,colour="black",alpha=0)+
#Simplified theme
theme_bw() +
facet_wrap(~prop_type_simplified, scales = "free") +
guides(color = F) +
NULLThis seems to hold true across a variety of categories and we will not consider it going forward.
Let’s also look at whether last review date has an impact
#Plotting scatterplot of availability
ggplot(listings2,
aes(x = last_review,
y= log(price))) +
#Points in red
geom_point(color="red", alpha = 0.2) +
#Useful Labels
labs(title = "Log price per night versus when the last review was",
x="Last Review (date)",
y="Log Price"
) +
#Adding a trendline
geom_smooth(method=lm,colour="black",alpha=0)+
#Simplified theme
theme_bw() +
NULLWe see a slight relationship between price and the date of last review. We also noticed that some properties do not have a recent review probably due to covid-19 outbreak.
We will plot the density for reference
#Plotting scatterplot of availability
ggplot(listings2,
aes(x = last_review)) +
geom_density(fill = "blue", alpha = 0.2) +
#Useful Labels
labs(title = "Distributions of last reviews",
x="Last Review (date)",
) +
#Simplified theme
theme_bw() +
NULLAlthough many properties did not receive a review during the Covid-19 outbreak, we will still include these data in our analysis.
#Plotting scatterplot of availability
ggplot(listings2,
aes(x = last_review,
y= log(price),
color = prop_type_simplified)) +
#Points in red
geom_point(alpha = 0.5) +
#Useful Labels
labs(title = "Log price per night versus when the last review was",
x="Last Review (date)",
y="Log Price"
) +
#Adding a trendline
geom_smooth(method=lm,colour="black",alpha=0)+
facet_wrap(~prop_type_simplified, scales = "free", ncol = 3) +
#Simplified theme
theme_bw() +
guides(color =F) +
NULL We see that this relationship holds across a variety of properties.
We plotted a GG pairs plot to get a comprehensive overview of the relationship between the price of listings and the selected variables.
We only plot the numerical variables where we can get a correlation and scatterplot of the relationship
#Relatonship between the price and variables, starting with numerical variables
listings2 %>%
#Choosing our variables
select(price, #Variable we are trying to explain
accommodates,
review_scores_rating,
number_of_reviews,
minimum_nights,
reviews_per_month,
bedrooms,
beds,
availability_30,
last_review) %>%
#Size and aesthetics
ggpairs(aes(alpha=0.2))+
theme_minimal(base_size=8) From the above we see that for price the most correlated variables is the number of accomodates or bedrooms (although these are correlated with 0.73 and will likely not both be included). However, from here it seems that perhaps it is better to focus on accommodates rather than bedrooms.
It is worth noting that some of the less correlated variables seem to be review scores, number of reviews, however, we will still try and test these in our regression.
Unfortunately, GGPairs does not work well with categorical variables and we will use that data we extracted above as an indicator for whether or not these variables will be useful going forward. `
For the target variable \(Y\), we will use the cost for two people to stay at an Airbnb location for four (4) nights. We first create a new variable called price_4_nights that uses price, and accomodates to calculate the total cost for two people to stay at the Airbnb property for 4 nights. This is the variable \(Y\) we want to explain.
We assume that this leaves us with Airbnbs accommodating 2 or more people. I.e. we filter out Airbnb’s which accomodate 1 person.
Also, we assume that no other people would go to the airbnb lowering the cost per person (i.e. 2 people could rent a 10 people Airbnb but would be the only two to pay).
Lets create the cost for two people and filter for the apartments that have 2 people min:
##Fist, we have to filter the data to represent 2 people staying for 4 nights.
listings_4 <- listings2 %>%
filter(accommodates>=2) %>%
mutate(price_4_nights =price*4)We take logarithm of ’price_4_nights‘ and create the density plot, because it looks more like the normal distribution. See below:
# Plot density of price using price_4_nights
#Using price 4 nights without log
ggplot(listings_4,aes(x=price_4_nights))+
#In a density plot
geom_density() +
#Useful labels
labs(
title = "Density plot for prices for 4 nights",
x = "price per night (log)",
y = "Density") +
#Simple Theme
theme_classic() +
NULL# Plot density of price using log(price_4_nights)
ggplot(listings_4,aes(x=log(price_4_nights)))+
#In density Plot
geom_density() +
#usfeul Labs
labs(
title = "Density plot for prices for 4 nights",
x = "price per night (log)",
y = "Density") +
#Sumple Theme
theme_classic() +
NULLWe see that taking the log of prices brings us much closer to a normal distribution why we use log prices going forward.
This is the same conclusion we made above here we just plotted the distribution for 4 nights and after adjusting our dataset throughout the above code.
We first fit a regression model with the following explanatory variables: prop_type_simplified, number_of_reviews, and review_scores_rating.
First, let us split the data into a training and testing set. The training will be used for the models and the testing later on:
#Set seed for regeneration purposes
set.seed(123456)
#Splitting 75% into a training set
train_test_split <- initial_split(listings_4, prop = 0.75)
#New variable name for training set
listings_train <- training(train_test_split)
#New Variable name for test set
listings_test <- testing(train_test_split)Lets now create the model using the initial variables:
#Create the model1
model1 <- lm(log(price_4_nights) ~ #Predicted Variable
#Explanatory Variables
prop_type_simplified +
number_of_reviews +
review_scores_rating,
#Dataset
data=listings_train)
#Summary of Model 1
summary(model1)
Call:
lm(formula = log(price_4_nights) ~ prop_type_simplified + number_of_reviews +
review_scores_rating, data = listings_train)
Residuals:
Min 1Q Median 3Q Max
-3.5932 -0.2978 -0.0354 0.2593 4.0634
Coefficients:
Estimate Std. Error t value
(Intercept) 5.836e+00 7.097e-02 82.232
prop_type_simplifiedEntire residential home 2.252e-01 2.056e-02 10.958
prop_type_simplifiedEntire townhouse 2.923e-01 3.130e-02 9.339
prop_type_simplifiedOther -6.992e-02 1.362e-02 -5.135
prop_type_simplifiedPrivate room in rental unit -5.324e-01 1.654e-02 -32.188
number_of_reviews -5.425e-04 8.414e-05 -6.447
review_scores_rating 1.107e-01 1.480e-02 7.483
Pr(>|t|)
(Intercept) < 2e-16 ***
prop_type_simplifiedEntire residential home < 2e-16 ***
prop_type_simplifiedEntire townhouse < 2e-16 ***
prop_type_simplifiedOther 2.88e-07 ***
prop_type_simplifiedPrivate room in rental unit < 2e-16 ***
number_of_reviews 1.20e-10 ***
review_scores_rating 8.03e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4467 on 8356 degrees of freedom
Multiple R-squared: 0.1707, Adjusted R-squared: 0.1701
F-statistic: 286.6 on 6 and 8356 DF, p-value: < 2.2e-16
We come to an adjusted R squared of 0.1707 which is a good start. We see that all variables have significant t-values indicating they are significantly different from 0.
More specifically these values changes as the category changes from entire rental unit (the variable excluded)
These numbers are compared to if the unit is entire rental Unit which is the variable excluded (Categorical Variables exclude one variable)
For review scores rating we see that as rating goes up 1 unit log of price goes up 0.019.
However, as number of reviews goes up price goes down. For every 100 additional reviews price falls roughly 0.05 (Likely because those properties are less premium).
We recall from earlier that reviews should account for room type which is included now:
Now lets add the room type
#Create the model2
model2 <- lm(log(price_4_nights) ~ #Predicted Variable
#Explanatory Variables
prop_type_simplified +
number_of_reviews +
review_scores_rating +
room_type,
#Dataset
data=listings_train)
#Summary of Model 2
summary(model2)
Call:
lm(formula = log(price_4_nights) ~ prop_type_simplified + number_of_reviews +
review_scores_rating + room_type, data = listings_train)
Residuals:
Min 1Q Median 3Q Max
-3.5898 -0.2870 -0.0362 0.2469 4.0502
Coefficients:
Estimate Std. Error t value
(Intercept) 5.970e+00 6.894e-02 86.594
prop_type_simplifiedEntire residential home 2.269e-01 1.990e-02 11.405
prop_type_simplifiedEntire townhouse 2.953e-01 3.030e-02 9.745
prop_type_simplifiedOther 2.389e-01 1.871e-02 12.768
prop_type_simplifiedPrivate room in rental unit 1.185e-02 2.810e-02 0.422
number_of_reviews -1.949e-04 8.281e-05 -2.354
review_scores_rating 8.157e-02 1.438e-02 5.673
room_typeHotel room -2.579e-01 6.379e-02 -4.043
room_typePrivate room -5.622e-01 2.385e-02 -23.569
room_typeShared room -6.104e-01 1.064e-01 -5.737
Pr(>|t|)
(Intercept) < 2e-16 ***
prop_type_simplifiedEntire residential home < 2e-16 ***
prop_type_simplifiedEntire townhouse < 2e-16 ***
prop_type_simplifiedOther < 2e-16 ***
prop_type_simplifiedPrivate room in rental unit 0.6732
number_of_reviews 0.0186 *
review_scores_rating 1.45e-08 ***
room_typeHotel room 5.33e-05 ***
room_typePrivate room < 2e-16 ***
room_typeShared room 9.96e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4324 on 8353 degrees of freedom
Multiple R-squared: 0.2231, Adjusted R-squared: 0.2223
F-statistic: 266.6 on 9 and 8353 DF, p-value: < 2.2e-16
We see that hotel rooms, private rooms and shared rooms all lower the price compared to entire home/apt.
Our model has higher Adjusted Rsquared but some variables are no longer significant.
Lets now test for VIF given our model has some p-value issues:
car::vif(model2) GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 4.162167 4 1.195129
number_of_reviews 1.141835 1 1.068567
review_scores_rating 1.020372 1 1.010135
room_type 4.292148 3 1.274811
Let’s keep the variables in given they are below 5 but there is some overlap between property type and room type.
We may have to account for this later on.
Also we earlier thought about accounting for number of reviews across properties types (e.g. by including the product of the two) but we will look into this relationship later once we confirm whether or not to keep property type as a variable.
We want to further determine whether the number ofbedrooms, beds, or size of the house (accomodates) significant predictors of price_4_nights. Lets add it to our model.
We keep in mind that these might be correlated and will test VIF after.
#New model
model3_1 <- lm(
#Predicted Variable
log(price_4_nights) ~
#Explanatory Variables
prop_type_simplified+
number_of_reviews+
review_scores_rating +
room_type +
beds +
bedrooms +
accommodates,
#dataset
data = listings_train)
summary(model3_1)
Call:
lm(formula = log(price_4_nights) ~ prop_type_simplified + number_of_reviews +
review_scores_rating + room_type + beds + bedrooms + accommodates,
data = listings_train)
Residuals:
Min 1Q Median 3Q Max
-3.4339 -0.2347 -0.0146 0.2135 4.0939
Coefficients:
Estimate Std. Error t value
(Intercept) 5.308e+00 6.297e-02 84.295
prop_type_simplifiedEntire residential home -7.360e-02 1.903e-02 -3.868
prop_type_simplifiedEntire townhouse 5.799e-03 2.790e-02 0.208
prop_type_simplifiedOther 1.478e-01 1.669e-02 8.855
prop_type_simplifiedPrivate room in rental unit -3.664e-02 2.490e-02 -1.472
number_of_reviews -2.342e-04 7.353e-05 -3.186
review_scores_rating 1.106e-01 1.278e-02 8.652
room_typeHotel room -7.229e-02 5.660e-02 -1.277
room_typePrivate room -3.881e-01 2.144e-02 -18.106
room_typeShared room -4.355e-01 9.433e-02 -4.616
beds 9.345e-03 5.935e-03 1.574
bedrooms 1.234e-01 9.302e-03 13.272
accommodates 1.165e-01 6.016e-03 19.368
Pr(>|t|)
(Intercept) < 2e-16 ***
prop_type_simplifiedEntire residential home 0.00011 ***
prop_type_simplifiedEntire townhouse 0.83537
prop_type_simplifiedOther < 2e-16 ***
prop_type_simplifiedPrivate room in rental unit 0.14117
number_of_reviews 0.00145 **
review_scores_rating < 2e-16 ***
room_typeHotel room 0.20157
room_typePrivate room < 2e-16 ***
room_typeShared room 3.97e-06 ***
beds 0.11541
bedrooms < 2e-16 ***
accommodates < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3828 on 8350 degrees of freedom
Multiple R-squared: 0.3914, Adjusted R-squared: 0.3906
F-statistic: 447.6 on 12 and 8350 DF, p-value: < 2.2e-16
We notice that beds are not significant and will take that out.
#New model
model3_2 <- lm(
#Predicted Variable
log(price_4_nights) ~
#Explanatory Variables
prop_type_simplified+
number_of_reviews+
review_scores_rating +
room_type +
bedrooms +
accommodates,
#dataset
data = listings_train)
summary(model3_2)
Call:
lm(formula = log(price_4_nights) ~ prop_type_simplified + number_of_reviews +
review_scores_rating + room_type + bedrooms + accommodates,
data = listings_train)
Residuals:
Min 1Q Median 3Q Max
-3.4334 -0.2340 -0.0144 0.2132 4.0992
Coefficients:
Estimate Std. Error t value
(Intercept) 5.302e+00 6.285e-02 84.349
prop_type_simplifiedEntire residential home -7.151e-02 1.898e-02 -3.767
prop_type_simplifiedEntire townhouse 8.377e-03 2.786e-02 0.301
prop_type_simplifiedOther 1.489e-01 1.667e-02 8.929
prop_type_simplifiedPrivate room in rental unit -3.620e-02 2.490e-02 -1.454
number_of_reviews -2.283e-04 7.344e-05 -3.108
review_scores_rating 1.105e-01 1.278e-02 8.648
room_typeHotel room -7.243e-02 5.661e-02 -1.279
room_typePrivate room -3.874e-01 2.143e-02 -18.074
room_typeShared room -4.307e-01 9.429e-02 -4.568
bedrooms 1.291e-01 8.582e-03 15.042
accommodates 1.212e-01 5.245e-03 23.100
Pr(>|t|)
(Intercept) < 2e-16 ***
prop_type_simplifiedEntire residential home 0.000166 ***
prop_type_simplifiedEntire townhouse 0.763655
prop_type_simplifiedOther < 2e-16 ***
prop_type_simplifiedPrivate room in rental unit 0.145961
number_of_reviews 0.001889 **
review_scores_rating < 2e-16 ***
room_typeHotel room 0.200763
room_typePrivate room < 2e-16 ***
room_typeShared room 5e-06 ***
bedrooms < 2e-16 ***
accommodates < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3828 on 8351 degrees of freedom
Multiple R-squared: 0.3912, Adjusted R-squared: 0.3904
F-statistic: 487.9 on 11 and 8351 DF, p-value: < 2.2e-16
We also notice that some variables are no longer significant.
Let’s check VIF
vif(model3_2) GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 5.152846 4 1.227456
number_of_reviews 1.145956 1 1.070493
review_scores_rating 1.028391 1 1.014096
room_type 4.429142 3 1.281504
bedrooms 2.473767 1 1.572821
accommodates 2.182146 1 1.477209
We now see quite a high variance inflation factor for Prop_type_simplified, and Room Type. This makes sense as the room type (i.e. hotel room) is related to property type (i.e. a hotel)
Let’s try and run and see each model without one another.
#Same model but without Property Type
model3_3 <- lm(
log(price_4_nights) ~
#Explanatory Variables
number_of_reviews+
review_scores_rating +
room_type+
bedrooms +
accommodates,
data = listings_train)
#Same model but without Room Type
model3_4 <- lm(
log(price_4_nights) ~
#Explanatory Variables
prop_type_simplified+
number_of_reviews+
review_scores_rating +
bedrooms +
accommodates,
data = listings_train)
vif(model3_3) GVIF Df GVIF^(1/(2*Df))
number_of_reviews 1.139493 1 1.067471
review_scores_rating 1.026732 1 1.013278
room_type 1.229365 3 1.035015
bedrooms 2.207683 1 1.485827
accommodates 2.163928 1 1.471029
vif(model3_4) GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 1.430238 4 1.045746
number_of_reviews 1.109832 1 1.053486
review_scores_rating 1.019848 1 1.009875
bedrooms 2.458383 1 1.567923
accommodates 2.173088 1 1.474140
We immidieatly lower our variance inflation factor.
Now let’s try and decide which model is stronger going forward.
summary(model3_3)
Call:
lm(formula = log(price_4_nights) ~ number_of_reviews + review_scores_rating +
room_type + bedrooms + accommodates, data = listings_train)
Residuals:
Min 1Q Median 3Q Max
-3.4421 -0.2413 -0.0190 0.2125 4.0866
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.2844695 0.0634856 83.239 < 2e-16 ***
number_of_reviews -0.0001524 0.0000741 -2.057 0.03968 *
review_scores_rating 0.1155040 0.0129204 8.940 < 2e-16 ***
room_typeHotel room 0.0628398 0.0552016 1.138 0.25500
room_typePrivate room -0.3499489 0.0116290 -30.093 < 2e-16 ***
room_typeShared room -0.2964919 0.0941722 -3.148 0.00165 **
bedrooms 0.1187920 0.0082029 14.482 < 2e-16 ***
accommodates 0.1266656 0.0052843 23.970 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3873 on 8355 degrees of freedom
Multiple R-squared: 0.3765, Adjusted R-squared: 0.376
F-statistic: 720.9 on 7 and 8355 DF, p-value: < 2.2e-16
summary(model3_4)
Call:
lm(formula = log(price_4_nights) ~ prop_type_simplified + number_of_reviews +
review_scores_rating + bedrooms + accommodates, data = listings_train)
Residuals:
Min 1Q Median 3Q Max
-3.4258 -0.2358 -0.0135 0.2199 4.1128
Coefficients:
Estimate Std. Error t value
(Intercept) 5.174e+00 6.369e-02 81.235
prop_type_simplifiedEntire residential home -9.289e-02 1.932e-02 -4.808
prop_type_simplifiedEntire townhouse -1.365e-02 2.838e-02 -0.481
prop_type_simplifiedOther -5.977e-02 1.191e-02 -5.017
prop_type_simplifiedPrivate room in rental unit -4.037e-01 1.468e-02 -27.502
number_of_reviews -4.579e-04 7.372e-05 -6.212
review_scores_rating 1.313e-01 1.298e-02 10.115
bedrooms 1.409e-01 8.726e-03 16.141
accommodates 1.267e-01 5.338e-03 23.737
Pr(>|t|)
(Intercept) < 2e-16 ***
prop_type_simplifiedEntire residential home 1.55e-06 ***
prop_type_simplifiedEntire townhouse 0.631
prop_type_simplifiedOther 5.36e-07 ***
prop_type_simplifiedPrivate room in rental unit < 2e-16 ***
number_of_reviews 5.48e-10 ***
review_scores_rating < 2e-16 ***
bedrooms < 2e-16 ***
accommodates < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3905 on 8354 degrees of freedom
Multiple R-squared: 0.3665, Adjusted R-squared: 0.3659
F-statistic: 604 on 8 and 8354 DF, p-value: < 2.2e-16
We see that model 3_2 (incl. room type) provides better explanatory power than 3_3 (incl. property type), and will utilize that going forward.
This model has an adjusted r square of 0.3659 and adds variables beds, bedrooms and accommodates.
We earlier commented that the relationship of number of ratings depends on room type so we analyze a model looking into the product of the two:
#Same model but without Property Type
model3_3_product <- lm(
log(price_4_nights) ~
#Explanatory Variables
number_of_reviews+
review_scores_rating +
room_type+
bedrooms +
accommodates +
room_type*number_of_reviews,
data = listings_train)
#Same model but without Room Type
summary(model3_3_product)
Call:
lm(formula = log(price_4_nights) ~ number_of_reviews + review_scores_rating +
room_type + bedrooms + accommodates + room_type * number_of_reviews,
data = listings_train)
Residuals:
Min 1Q Median 3Q Max
-3.4407 -0.2371 -0.0164 0.2117 4.0862
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.2747982 0.0635659 82.982 <2e-16
number_of_reviews 0.0002605 0.0001480 1.761 0.0783
review_scores_rating 0.1162390 0.0129210 8.996 <2e-16
room_typeHotel room 0.1253254 0.0628741 1.993 0.0463
room_typePrivate room -0.3346790 0.0128508 -26.043 <2e-16
room_typeShared room -0.2845573 0.1142397 -2.491 0.0128
bedrooms 0.1210258 0.0082253 14.714 <2e-16
accommodates 0.1250238 0.0053038 23.573 <2e-16
number_of_reviews:room_typeHotel room -0.0011738 0.0004459 -2.632 0.0085
number_of_reviews:room_typePrivate room -0.0005238 0.0001718 -3.049 0.0023
number_of_reviews:room_typeShared room -0.0004791 0.0011559 -0.415 0.6785
(Intercept) ***
number_of_reviews .
review_scores_rating ***
room_typeHotel room *
room_typePrivate room ***
room_typeShared room *
bedrooms ***
accommodates ***
number_of_reviews:room_typeHotel room **
number_of_reviews:room_typePrivate room **
number_of_reviews:room_typeShared room
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3871 on 8352 degrees of freedom
Multiple R-squared: 0.3775, Adjusted R-squared: 0.3767
F-statistic: 506.5 on 10 and 8352 DF, p-value: < 2.2e-16
This input does not offer addiitonal explanatory power and includes 3 additional variables so we will not consider it going forward.
We want to explore whether the superhosts (host_is_superhost) command a pricing premium, after controlling for other variables.
Let us build this model on model 3.2
#Model 3.2 incl. superhost
model4 <- lm(
log(price_4_nights) ~
#Explanatory Variables
number_of_reviews+
review_scores_rating +
room_type+
bedrooms +
accommodates+
host_is_superhost,
data = listings_train)
summary(model4)
Call:
lm(formula = log(price_4_nights) ~ number_of_reviews + review_scores_rating +
room_type + bedrooms + accommodates + host_is_superhost,
data = listings_train)
Residuals:
Min 1Q Median 3Q Max
-3.4313 -0.2353 -0.0155 0.2141 4.0933
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.369e+00 6.382e-02 84.134 < 2e-16 ***
number_of_reviews -3.479e-04 7.665e-05 -4.538 5.76e-06 ***
review_scores_rating 9.681e-02 1.301e-02 7.441 1.10e-13 ***
room_typeHotel room 1.667e-02 5.514e-02 0.302 0.76237
room_typePrivate room -3.705e-01 1.178e-02 -31.455 < 2e-16 ***
room_typeShared room -2.850e-01 9.370e-02 -3.041 0.00236 **
bedrooms 1.220e-01 8.168e-03 14.931 < 2e-16 ***
accommodates 1.239e-01 5.266e-03 23.523 < 2e-16 ***
host_is_superhostTRUE 1.239e-01 1.332e-02 9.303 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3854 on 8354 degrees of freedom
Multiple R-squared: 0.3829, Adjusted R-squared: 0.3823
F-statistic: 648 on 8 and 8354 DF, p-value: < 2.2e-16
The variable is significant however, it only has very little explanatory power.
A superhost adds 0.123 to the log price.
Let’s try and compute the VIF once again to check out model
vif(model4) GVIF Df GVIF^(1/(2*Df))
number_of_reviews 1.232011 1 1.109960
review_scores_rating 1.051819 1 1.025582
room_type 1.282096 3 1.042286
bedrooms 2.211537 1 1.487124
accommodates 2.170987 1 1.473427
host_is_superhost 1.210964 1 1.100438
We see that VIF is below 5 for all variables meaning the variables do not have too high correlation
instant_bookable a significant predictor of price_4_nights#Model 4.0 incl. immidiate bookings
model5 <- lm(
log(price_4_nights) ~
#Explanatory Variables
number_of_reviews+
review_scores_rating +
room_type+
bedrooms +
accommodates+
host_is_superhost+
instant_bookable,
data = listings_train)
summary(model5)
Call:
lm(formula = log(price_4_nights) ~ number_of_reviews + review_scores_rating +
room_type + bedrooms + accommodates + host_is_superhost +
instant_bookable, data = listings_train)
Residuals:
Min 1Q Median 3Q Max
-3.4236 -0.2342 -0.0153 0.2152 4.0506
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.334e+00 6.409e-02 83.227 < 2e-16 ***
number_of_reviews -3.623e-04 7.659e-05 -4.730 2.28e-06 ***
review_scores_rating 1.028e-01 1.304e-02 7.881 3.65e-15 ***
room_typeHotel room -1.783e-02 5.547e-02 -0.322 0.74783
room_typePrivate room -3.811e-01 1.194e-02 -31.919 < 2e-16 ***
room_typeShared room -2.811e-01 9.356e-02 -3.005 0.00267 **
bedrooms 1.245e-01 8.171e-03 15.235 < 2e-16 ***
accommodates 1.220e-01 5.270e-03 23.154 < 2e-16 ***
host_is_superhostTRUE 1.180e-01 1.335e-02 8.840 < 2e-16 ***
instant_bookableTRUE 5.513e-02 1.071e-02 5.149 2.68e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3848 on 8353 degrees of freedom
Multiple R-squared: 0.3849, Adjusted R-squared: 0.3842
F-statistic: 580.7 on 9 and 8353 DF, p-value: < 2.2e-16
Here we see instant bookable is a significant predictor which will can be used in our model.
If a hotel is instantly bookable log price goes up by 0.055
We once again control VIF
vif(model5) GVIF Df GVIF^(1/(2*Df))
number_of_reviews 1.233658 1 1.110702
review_scores_rating 1.060229 1 1.029674
room_type 1.336159 3 1.049485
bedrooms 2.219462 1 1.489786
accommodates 2.181078 1 1.476847
host_is_superhost 1.219980 1 1.104527
instant_bookable 1.101576 1 1.049560
Still these factors are not correlated enough for us to be worried.
For this section we have not chosen the neighbourhood_overview variables as it had a lot of NAs. We use the categories we set up earlier which split neighbourhoods into the most costly and least costly area.
We therefore expect it to have high explanatory power.
#Model 5 + neighbourhoods
model6 <- lm(
#Variable to predict
log(price_4_nights) ~
#Explanatory Variables
number_of_reviews +
review_scores_rating +
room_type +
bedrooms +
accommodates +
host_is_superhost +
instant_bookable +
neighbourhood_cleansed,
#Dataset
data = listings_train)
summary(model6)
Call:
lm(formula = log(price_4_nights) ~ number_of_reviews + review_scores_rating +
room_type + bedrooms + accommodates + host_is_superhost +
instant_bookable + neighbourhood_cleansed, data = listings_train)
Residuals:
Min 1Q Median 3Q Max
-3.5398 -0.2242 -0.0157 0.2040 4.0793
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.048e+00 6.347e-02 79.527 < 2e-16 ***
number_of_reviews -4.719e-04 7.316e-05 -6.450 1.18e-10 ***
review_scores_rating 1.101e-01 1.244e-02 8.845 < 2e-16 ***
room_typeHotel room -2.245e-03 5.295e-02 -0.042 0.96618
room_typePrivate room -3.705e-01 1.151e-02 -32.186 < 2e-16 ***
room_typeShared room -2.567e-01 8.927e-02 -2.875 0.00405 **
bedrooms 1.329e-01 7.804e-03 17.032 < 2e-16 ***
accommodates 1.193e-01 5.029e-03 23.728 < 2e-16 ***
host_is_superhostTRUE 1.099e-01 1.274e-02 8.632 < 2e-16 ***
instant_bookableTRUE 5.557e-02 1.022e-02 5.435 5.62e-08 ***
neighbourhood_cleansedTop 11-15 8.619e-02 1.936e-02 4.453 8.60e-06 ***
neighbourhood_cleansedTop 5 3.658e-01 1.739e-02 21.034 < 2e-16 ***
neighbourhood_cleansedTop 6-10 2.274e-01 1.757e-02 12.939 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.367 on 8350 degrees of freedom
Multiple R-squared: 0.4406, Adjusted R-squared: 0.4398
F-statistic: 548.1 on 12 and 8350 DF, p-value: < 2.2e-16
We see that neighbourgood categorization is a really impactful driver on our prices as our adjusted r square went to ca. 44%.
We see that compared to the bottom 7 neighbourhoods the prices are impacted as such:
Testing VIF
vif(model6) GVIF Df GVIF^(1/(2*Df))
number_of_reviews 1.237386 1 1.112378
review_scores_rating 1.060844 1 1.029973
room_type 1.367241 3 1.053515
bedrooms 2.225889 1 1.491941
accommodates 2.182781 1 1.477424
host_is_superhost 1.220813 1 1.104904
instant_bookable 1.103987 1 1.050708
neighbourhood_cleansed 1.043262 3 1.007084
Still VIF looks fine.
Next lets look at availability in the coming 30 days
#Model 6 + 30 day availability
model7 <- lm(
#Variable to predict
log(price_4_nights) ~
#Explanatory Variables
number_of_reviews +
review_scores_rating +
room_type +
bedrooms +
accommodates +
host_is_superhost +
instant_bookable +
neighbourhood_cleansed+
availability_30,
#Dataset
data = listings_train)
summary(model7)
Call:
lm(formula = log(price_4_nights) ~ number_of_reviews + review_scores_rating +
room_type + bedrooms + accommodates + host_is_superhost +
instant_bookable + neighbourhood_cleansed + availability_30,
data = listings_train)
Residuals:
Min 1Q Median 3Q Max
-3.5106 -0.2170 -0.0067 0.1992 4.1299
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.029e+00 6.070e-02 82.851 < 2e-16 ***
number_of_reviews -6.802e-04 7.035e-05 -9.668 < 2e-16 ***
review_scores_rating 1.103e-01 1.190e-02 9.268 < 2e-16 ***
room_typeHotel room -9.324e-02 5.073e-02 -1.838 0.06612 .
room_typePrivate room -4.307e-01 1.122e-02 -38.403 < 2e-16 ***
room_typeShared room -2.978e-01 8.538e-02 -3.488 0.00049 ***
bedrooms 1.453e-01 7.476e-03 19.437 < 2e-16 ***
accommodates 1.061e-01 4.832e-03 21.952 < 2e-16 ***
host_is_superhostTRUE 8.877e-02 1.220e-02 7.274 3.80e-13 ***
instant_bookableTRUE 3.716e-02 9.798e-03 3.793 0.00015 ***
neighbourhood_cleansedTop 11-15 9.902e-02 1.852e-02 5.348 9.15e-08 ***
neighbourhood_cleansedTop 5 3.715e-01 1.663e-02 22.338 < 2e-16 ***
neighbourhood_cleansedTop 6-10 2.453e-01 1.682e-02 14.590 < 2e-16 ***
availability_30 1.385e-02 4.950e-04 27.979 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3509 on 8349 degrees of freedom
Multiple R-squared: 0.4886, Adjusted R-squared: 0.4878
F-statistic: 613.5 on 13 and 8349 DF, p-value: < 2.2e-16
Once again, this variable is a significant explanatory variable. As availability over the next 30 days goes up with 1 log prices increase by 0.014
Let’s test for VIF once again:
vif(model7) GVIF Df GVIF^(1/(2*Df))
number_of_reviews 1.251401 1 1.118660
review_scores_rating 1.060844 1 1.029973
room_type 1.422204 3 1.060458
bedrooms 2.233719 1 1.494563
accommodates 2.203964 1 1.484575
host_is_superhost 1.225527 1 1.107035
instant_bookable 1.108985 1 1.053084
neighbourhood_cleansed 1.046609 3 1.007621
availability_30 1.141229 1 1.068283
All VIFS are low and therefore we make no changes.
#Model 7 + License
model8 <- lm(
#Variable to predict
log(price_4_nights) ~
#Explanatory Variables
number_of_reviews +
review_scores_rating +
room_type +
bedrooms +
accommodates +
host_is_superhost +
instant_bookable +
neighbourhood_cleansed+
availability_30 +
license,
#Dataset
data = listings_train)
summary(model8)
Call:
lm(formula = log(price_4_nights) ~ number_of_reviews + review_scores_rating +
room_type + bedrooms + accommodates + host_is_superhost +
instant_bookable + neighbourhood_cleansed + availability_30 +
license, data = listings_train)
Residuals:
Min 1Q Median 3Q Max
-3.5100 -0.2168 -0.0067 0.1993 4.1306
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.031e+00 6.091e-02 82.598 < 2e-16 ***
number_of_reviews -6.854e-04 7.128e-05 -9.615 < 2e-16 ***
review_scores_rating 1.098e-01 1.195e-02 9.190 < 2e-16 ***
room_typeHotel room -9.457e-02 5.082e-02 -1.861 0.062800 .
room_typePrivate room -4.310e-01 1.124e-02 -38.361 < 2e-16 ***
room_typeShared room -2.979e-01 8.538e-02 -3.488 0.000488 ***
bedrooms 1.450e-01 7.500e-03 19.340 < 2e-16 ***
accommodates 1.060e-01 4.833e-03 21.943 < 2e-16 ***
host_is_superhostTRUE 8.767e-02 1.244e-02 7.048 1.95e-12 ***
instant_bookableTRUE 3.688e-02 9.817e-03 3.757 0.000173 ***
neighbourhood_cleansedTop 11-15 9.889e-02 1.852e-02 5.340 9.56e-08 ***
neighbourhood_cleansedTop 5 3.713e-01 1.663e-02 22.322 < 2e-16 ***
neighbourhood_cleansedTop 6-10 2.452e-01 1.682e-02 14.578 < 2e-16 ***
availability_30 1.379e-02 5.100e-04 27.047 < 2e-16 ***
licenseyes 4.356e-03 9.586e-03 0.454 0.649569
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.351 on 8348 degrees of freedom
Multiple R-squared: 0.4886, Adjusted R-squared: 0.4877
F-statistic: 569.6 on 14 and 8348 DF, p-value: < 2.2e-16
License is not a significant predictor of price when controlling for remaining variables. Let’s go to the next variable
#Model 7 + Last Review
model9 <- lm(
#Variable to predict
log(price_4_nights) ~
#Explanatory Variables
number_of_reviews +
review_scores_rating +
room_type +
bedrooms +
accommodates +
host_is_superhost +
instant_bookable +
neighbourhood_cleansed+
availability_30 +
last_review,
#Dataset
data = listings_train)
summary(model9)
Call:
lm(formula = log(price_4_nights) ~ number_of_reviews + review_scores_rating +
room_type + bedrooms + accommodates + host_is_superhost +
instant_bookable + neighbourhood_cleansed + availability_30 +
last_review, data = listings_train)
Residuals:
Min 1Q Median 3Q Max
-3.4853 -0.2127 -0.0085 0.1972 4.1747
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.772e+00 1.422e-01 26.527 < 2e-16 ***
number_of_reviews -7.711e-04 7.058e-05 -10.926 < 2e-16 ***
review_scores_rating 9.725e-02 1.191e-02 8.167 3.61e-16 ***
room_typeHotel room -9.117e-02 5.045e-02 -1.807 0.070777 .
room_typePrivate room -4.327e-01 1.116e-02 -38.785 < 2e-16 ***
room_typeShared room -2.818e-01 8.492e-02 -3.319 0.000909 ***
bedrooms 1.435e-01 7.436e-03 19.293 < 2e-16 ***
accommodates 1.056e-01 4.805e-03 21.968 < 2e-16 ***
host_is_superhostTRUE 6.673e-02 1.234e-02 5.407 6.60e-08 ***
instant_bookableTRUE 2.061e-02 9.889e-03 2.084 0.037160 *
neighbourhood_cleansedTop 11-15 9.869e-02 1.841e-02 5.360 8.56e-08 ***
neighbourhood_cleansedTop 5 3.695e-01 1.654e-02 22.341 < 2e-16 ***
neighbourhood_cleansedTop 6-10 2.450e-01 1.672e-02 14.653 < 2e-16 ***
availability_30 1.269e-02 5.063e-04 25.056 < 2e-16 ***
last_review 7.466e-05 7.646e-06 9.764 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.349 on 8348 degrees of freedom
Multiple R-squared: 0.4943, Adjusted R-squared: 0.4935
F-statistic: 582.9 on 14 and 8348 DF, p-value: < 2.2e-16
Last review is a significant predicotr of price and for each day of more recent reviews the prices are 0.00007 higher.
Let’s check the VIF for our final model
vif(model9) GVIF Df GVIF^(1/(2*Df))
number_of_reviews 1.273594 1 1.128536
review_scores_rating 1.074327 1 1.036497
room_type 1.423269 3 1.060591
bedrooms 2.235151 1 1.495042
accommodates 2.204224 1 1.484663
host_is_superhost 1.267912 1 1.126016
instant_bookable 1.142555 1 1.068904
neighbourhood_cleansed 1.047138 3 1.007706
availability_30 1.207846 1 1.099021
last_review 1.278776 1 1.130830
Variance inflation factor is below 5 for all variables which shows no sign of autocorrelation.
Let’s also plot the model using the autoplot to check for potential issues:
autoplot(model9) For our model we see that resudual errors are evenly distributed and do not change as the variable fitted values change. This is a good sign for our model.
Futhermore we see from the theoretical quantiles that our model is approximately evenly distributed but not 100% evenly distributed.
Our final model has an r-squared of 0.494 and excludes the license categorical variable (Y/N), beds (numerical), and property type.
We now test our model for overfitting
#To get the RMSE from the training set
rmse_train <- listings_train %>%
#Calculating RMSE
mutate(predictions = predict(model9, .)) %>%
select(predictions,price_4_nights) %>%
mutate(squared_error = (predictions - log(price_4_nights))^2) %>%
summarise(rmse = sqrt(mean(squared_error))) %>%
#Printing RMSE
pull()
rmse_train[1] 0.348655
#Repeated for test set
rmse_test <- listings_test %>%
#Calculating RMSE
mutate(predictions = predict(model9, .)) %>%
select(predictions,price_4_nights) %>%
mutate(squared_error = (predictions - log(price_4_nights))^2) %>%
summarise(rmse = sqrt(mean(squared_error))) %>%
#Printing RMSE
pull()
rmse_test[1] 0.3491043
There seems to be no overfitting in our model. Let’s see if we can adjust the final model by taking out some of the less key variables without loosing explanatory power.
Specifically last review and instant bookable seemed to have little effects on the explanatory power
#Model final adjusted without instant_bookable and without last review:
model_final <- lm(
#Variable to predict
log(price_4_nights) ~
#Explanatory Variables
number_of_reviews +
review_scores_rating +
room_type +
bedrooms +
accommodates +
host_is_superhost +
neighbourhood_cleansed+
availability_30,
#Dataset
data = listings_train)
summary(model_final)
Call:
lm(formula = log(price_4_nights) ~ number_of_reviews + review_scores_rating +
room_type + bedrooms + accommodates + host_is_superhost +
neighbourhood_cleansed + availability_30, data = listings_train)
Residuals:
Min 1Q Median 3Q Max
-3.5161 -0.2156 -0.0072 0.1992 4.1598
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.054e+00 6.040e-02 83.669 < 2e-16 ***
number_of_reviews -6.726e-04 7.038e-05 -9.557 < 2e-16 ***
review_scores_rating 1.063e-01 1.186e-02 8.961 < 2e-16 ***
room_typeHotel room -7.094e-02 5.043e-02 -1.407 0.159563
room_typePrivate room -4.244e-01 1.110e-02 -38.235 < 2e-16 ***
room_typeShared room -3.011e-01 8.544e-02 -3.525 0.000427 ***
bedrooms 1.437e-01 7.470e-03 19.236 < 2e-16 ***
accommodates 1.072e-01 4.827e-03 22.203 < 2e-16 ***
host_is_superhostTRUE 9.250e-02 1.217e-02 7.599 3.31e-14 ***
neighbourhood_cleansedTop 11-15 9.865e-02 1.853e-02 5.323 1.05e-07 ***
neighbourhood_cleansedTop 5 3.709e-01 1.664e-02 22.287 < 2e-16 ***
neighbourhood_cleansedTop 6-10 2.436e-01 1.682e-02 14.480 < 2e-16 ***
availability_30 1.397e-02 4.942e-04 28.275 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3512 on 8350 degrees of freedom
Multiple R-squared: 0.4877, Adjusted R-squared: 0.4869
F-statistic: 662.4 on 12 and 8350 DF, p-value: < 2.2e-16
We still keep quite a high explanatory power of 0.487% so let’s see again the RMSE
#To get the RMSE from the training set
rmse_train <- listings_train %>%
#Calculating RMSE
mutate(predictions = predict(model_final, .)) %>%
select(predictions,price_4_nights) %>%
mutate(squared_error = (predictions - log(price_4_nights))^2) %>%
summarise(rmse = sqrt(mean(squared_error))) %>%
#Printing RMSE
pull()
rmse_train[1] 0.3509421
#Repeated for test set
rmse_test <- listings_test %>%
#Calculating RMSE
mutate(predictions = predict(model_final, .)) %>%
select(predictions,price_4_nights) %>%
mutate(squared_error = (predictions - log(price_4_nights))^2) %>%
summarise(rmse = sqrt(mean(squared_error))) %>%
#Printing RMSE
pull()
rmse_test[1] 0.3512415
Our RMSE is still low which is good
First let us summarize the final model:
summary(model_final)
Call:
lm(formula = log(price_4_nights) ~ number_of_reviews + review_scores_rating +
room_type + bedrooms + accommodates + host_is_superhost +
neighbourhood_cleansed + availability_30, data = listings_train)
Residuals:
Min 1Q Median 3Q Max
-3.5161 -0.2156 -0.0072 0.1992 4.1598
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.054e+00 6.040e-02 83.669 < 2e-16 ***
number_of_reviews -6.726e-04 7.038e-05 -9.557 < 2e-16 ***
review_scores_rating 1.063e-01 1.186e-02 8.961 < 2e-16 ***
room_typeHotel room -7.094e-02 5.043e-02 -1.407 0.159563
room_typePrivate room -4.244e-01 1.110e-02 -38.235 < 2e-16 ***
room_typeShared room -3.011e-01 8.544e-02 -3.525 0.000427 ***
bedrooms 1.437e-01 7.470e-03 19.236 < 2e-16 ***
accommodates 1.072e-01 4.827e-03 22.203 < 2e-16 ***
host_is_superhostTRUE 9.250e-02 1.217e-02 7.599 3.31e-14 ***
neighbourhood_cleansedTop 11-15 9.865e-02 1.853e-02 5.323 1.05e-07 ***
neighbourhood_cleansedTop 5 3.709e-01 1.664e-02 22.287 < 2e-16 ***
neighbourhood_cleansedTop 6-10 2.436e-01 1.682e-02 14.480 < 2e-16 ***
availability_30 1.397e-02 4.942e-04 28.275 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3512 on 8350 degrees of freedom
Multiple R-squared: 0.4877, Adjusted R-squared: 0.4869
F-statistic: 662.4 on 12 and 8350 DF, p-value: < 2.2e-16
The model has an adjusted r square and explanatory power of 48.69%.
Let’s check for issues with residuals.
autoplot(model_final)Again we see there are no issues with residuals or with normal distribution.
Let’s check VIF as well
vif(model_final) GVIF Df GVIF^(1/(2*Df))
number_of_reviews 1.250398 1 1.118212
review_scores_rating 1.052602 1 1.025964
room_type 1.376315 3 1.054677
bedrooms 2.226404 1 1.492114
accommodates 2.195945 1 1.481872
host_is_superhost 1.217543 1 1.103423
neighbourhood_cleansed 1.044653 3 1.007307
availability_30 1.136085 1 1.065873
Also, again we still see no issues.
Lets plot our models
huxreg(list(
"Prop Type, Reviews, Rating" = model1,
"+ Room Type" = model2,
"+ Beds + Bedrooms + Accom." = model3_1,
"- Beds" = model3_2,
"- Prop Type" = model3_3,
" - Room Type + Prop Type" = model3_4,
" + No Reviews*Room_type" = model3_3_product,
"+ Superhost" = model4,
"+ immidiate bookings" = model5,
"+ Neighbourhoods" = model6,
"+ Availability" = model7,
"+ license" = model8,
"+ last review" = model9,
"FINAL (- instant bookable - last_review)" = model_final))| Prop Type, Reviews, Rating | + Room Type | + Beds + Bedrooms + Accom. | - Beds | - Prop Type | - Room Type + Prop Type | + No Reviews*Room_type | + Superhost | + immidiate bookings | + Neighbourhoods | + Availability | + license | + last review | FINAL (- instant bookable - last_review) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| (Intercept) | 5.836 *** | 5.970 *** | 5.308 *** | 5.302 *** | 5.284 *** | 5.174 *** | 5.275 *** | 5.369 *** | 5.334 *** | 5.048 *** | 5.029 *** | 5.031 *** | 3.772 *** | 5.054 *** |
| (0.071) | (0.069) | (0.063) | (0.063) | (0.063) | (0.064) | (0.064) | (0.064) | (0.064) | (0.063) | (0.061) | (0.061) | (0.142) | (0.060) | |
| prop_type_simplifiedEntire residential home | 0.225 *** | 0.227 *** | -0.074 *** | -0.072 *** | -0.093 *** | |||||||||
| (0.021) | (0.020) | (0.019) | (0.019) | (0.019) | ||||||||||
| prop_type_simplifiedEntire townhouse | 0.292 *** | 0.295 *** | 0.006 | 0.008 | -0.014 | |||||||||
| (0.031) | (0.030) | (0.028) | (0.028) | (0.028) | ||||||||||
| prop_type_simplifiedOther | -0.070 *** | 0.239 *** | 0.148 *** | 0.149 *** | -0.060 *** | |||||||||
| (0.014) | (0.019) | (0.017) | (0.017) | (0.012) | ||||||||||
| prop_type_simplifiedPrivate room in rental unit | -0.532 *** | 0.012 | -0.037 | -0.036 | -0.404 *** | |||||||||
| (0.017) | (0.028) | (0.025) | (0.025) | (0.015) | ||||||||||
| number_of_reviews | -0.001 *** | -0.000 * | -0.000 ** | -0.000 ** | -0.000 * | -0.000 *** | 0.000 | -0.000 *** | -0.000 *** | -0.000 *** | -0.001 *** | -0.001 *** | -0.001 *** | -0.001 *** |
| (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | |
| review_scores_rating | 0.111 *** | 0.082 *** | 0.111 *** | 0.111 *** | 0.116 *** | 0.131 *** | 0.116 *** | 0.097 *** | 0.103 *** | 0.110 *** | 0.110 *** | 0.110 *** | 0.097 *** | 0.106 *** |
| (0.015) | (0.014) | (0.013) | (0.013) | (0.013) | (0.013) | (0.013) | (0.013) | (0.013) | (0.012) | (0.012) | (0.012) | (0.012) | (0.012) | |
| room_typeHotel room | -0.258 *** | -0.072 | -0.072 | 0.063 | 0.125 * | 0.017 | -0.018 | -0.002 | -0.093 | -0.095 | -0.091 | -0.071 | ||
| (0.064) | (0.057) | (0.057) | (0.055) | (0.063) | (0.055) | (0.055) | (0.053) | (0.051) | (0.051) | (0.050) | (0.050) | |||
| room_typePrivate room | -0.562 *** | -0.388 *** | -0.387 *** | -0.350 *** | -0.335 *** | -0.371 *** | -0.381 *** | -0.371 *** | -0.431 *** | -0.431 *** | -0.433 *** | -0.424 *** | ||
| (0.024) | (0.021) | (0.021) | (0.012) | (0.013) | (0.012) | (0.012) | (0.012) | (0.011) | (0.011) | (0.011) | (0.011) | |||
| room_typeShared room | -0.610 *** | -0.435 *** | -0.431 *** | -0.296 ** | -0.285 * | -0.285 ** | -0.281 ** | -0.257 ** | -0.298 *** | -0.298 *** | -0.282 *** | -0.301 *** | ||
| (0.106) | (0.094) | (0.094) | (0.094) | (0.114) | (0.094) | (0.094) | (0.089) | (0.085) | (0.085) | (0.085) | (0.085) | |||
| beds | 0.009 | |||||||||||||
| (0.006) | ||||||||||||||
| bedrooms | 0.123 *** | 0.129 *** | 0.119 *** | 0.141 *** | 0.121 *** | 0.122 *** | 0.124 *** | 0.133 *** | 0.145 *** | 0.145 *** | 0.143 *** | 0.144 *** | ||
| (0.009) | (0.009) | (0.008) | (0.009) | (0.008) | (0.008) | (0.008) | (0.008) | (0.007) | (0.007) | (0.007) | (0.007) | |||
| accommodates | 0.117 *** | 0.121 *** | 0.127 *** | 0.127 *** | 0.125 *** | 0.124 *** | 0.122 *** | 0.119 *** | 0.106 *** | 0.106 *** | 0.106 *** | 0.107 *** | ||
| (0.006) | (0.005) | (0.005) | (0.005) | (0.005) | (0.005) | (0.005) | (0.005) | (0.005) | (0.005) | (0.005) | (0.005) | |||
| number_of_reviews:room_typeHotel room | -0.001 ** | |||||||||||||
| (0.000) | ||||||||||||||
| number_of_reviews:room_typePrivate room | -0.001 ** | |||||||||||||
| (0.000) | ||||||||||||||
| number_of_reviews:room_typeShared room | -0.000 | |||||||||||||
| (0.001) | ||||||||||||||
| host_is_superhostTRUE | 0.124 *** | 0.118 *** | 0.110 *** | 0.089 *** | 0.088 *** | 0.067 *** | 0.093 *** | |||||||
| (0.013) | (0.013) | (0.013) | (0.012) | (0.012) | (0.012) | (0.012) | ||||||||
| instant_bookableTRUE | 0.055 *** | 0.056 *** | 0.037 *** | 0.037 *** | 0.021 * | |||||||||
| (0.011) | (0.010) | (0.010) | (0.010) | (0.010) | ||||||||||
| neighbourhood_cleansedTop 11-15 | 0.086 *** | 0.099 *** | 0.099 *** | 0.099 *** | 0.099 *** | |||||||||
| (0.019) | (0.019) | (0.019) | (0.018) | (0.019) | ||||||||||
| neighbourhood_cleansedTop 5 | 0.366 *** | 0.371 *** | 0.371 *** | 0.369 *** | 0.371 *** | |||||||||
| (0.017) | (0.017) | (0.017) | (0.017) | (0.017) | ||||||||||
| neighbourhood_cleansedTop 6-10 | 0.227 *** | 0.245 *** | 0.245 *** | 0.245 *** | 0.244 *** | |||||||||
| (0.018) | (0.017) | (0.017) | (0.017) | (0.017) | ||||||||||
| availability_30 | 0.014 *** | 0.014 *** | 0.013 *** | 0.014 *** | ||||||||||
| (0.000) | (0.001) | (0.001) | (0.000) | |||||||||||
| licenseyes | 0.004 | |||||||||||||
| (0.010) | ||||||||||||||
| last_review | 0.000 *** | |||||||||||||
| (0.000) | ||||||||||||||
| N | 8363 | 8363 | 8363 | 8363 | 8363 | 8363 | 8363 | 8363 | 8363 | 8363 | 8363 | 8363 | 8363 | 8363 |
| R2 | 0.171 | 0.223 | 0.391 | 0.391 | 0.377 | 0.366 | 0.377 | 0.383 | 0.385 | 0.441 | 0.489 | 0.489 | 0.494 | 0.488 |
| logLik | -5123.513 | -4850.132 | -3829.296 | -3830.537 | -3930.350 | -3997.392 | -3924.011 | -3887.254 | -3874.003 | -3476.954 | -3102.201 | -3102.097 | -3054.720 | -3109.400 |
| AIC | 10263.026 | 9722.264 | 7686.591 | 7687.074 | 7878.700 | 8014.784 | 7872.023 | 7794.508 | 7770.007 | 6981.909 | 6234.401 | 6236.195 | 6141.439 | 6246.800 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | ||||||||||||||
We see now we have significant variables and the final model has less variables but still decent explanatory power (48.8%) and similar RMSE in the test and training set.
Interestingly, our model only predicts roughly 50% of price variations. The remaining 50% is explained by variables not included here.
50% does seem high given the final model is only using 8 different variables with some of the being categorical.
In sum our model:
We will now try to use our model to predict prices for staying for 4 nights, in a private room, with at least 10 reviews, and reviews above 4.98 (We use 4.98 to lower sample size)
First let us define our target segment
#Utilizing the full dataset
targets <- listings_4 %>%
#BY Private Room
filter(room_type == "Private room",
#At least 10 reviews
number_of_reviews >= 10,
#A score above 4.9
review_scores_rating >= 4.98) %>%
#Predicted values in log
mutate(log_predicted_values = predict(model_final, .),
#Predicted Values Nominal
prediction = exp(log_predicted_values))
targets| host_since | host_is_superhost | neighbourhood_cleansed | room_type | accommodates | price | review_scores_rating | number_of_reviews | minimum_nights | reviews_per_month | bedrooms | beds | availability_30 | last_review | license | instant_bookable | prop_type_simplified | price_4_nights | log_predicted_values | prediction |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2011-09-09 | FALSE | Top 5 | Private room | 2 | 245 | 5 | 10 | 2 | 0.42 | 1 | 1 | 1 | 2020-01-21 | yes | FALSE | Other | 980 | 5.9 | 364 |
| 2013-01-07 | TRUE | Top 5 | Private room | 2 | 135 | 4.98 | 51 | 2 | 0.86 | 1 | 1 | 16 | 2020-03-13 | yes | FALSE | Other | 540 | 6.17 | 478 |
| 2013-08-04 | TRUE | Top 5 | Private room | 2 | 155 | 4.99 | 70 | 3 | 1.1 | 1 | 1 | 0 | 2020-03-02 | yes | FALSE | Other | 620 | 5.93 | 378 |
| 2013-08-27 | TRUE | Top 11-15 | Private room | 2 | 45 | 4.98 | 43 | 4 | 0.44 | 1 | 1 | 0 | 2018-06-01 | yes | TRUE | Other | 180 | 5.68 | 293 |
| 2013-10-19 | TRUE | Top 5 | Private room | 2 | 89 | 5 | 11 | 2 | 0.14 | 1 | 1 | 0 | 2015-10-22 | no | FALSE | Other | 356 | 5.97 | 393 |
| 2014-06-28 | TRUE | Top 6-10 | Private room | 2 | 70 | 5 | 13 | 2 | 0.42 | 1 | 1 | 0 | 2019-05-12 | no | FALSE | Private room in rental unit | 280 | 5.85 | 346 |
| 2014-08-01 | FALSE | Top 5 | Private room | 2 | 126 | 5 | 14 | 3 | 0.53 | 1 | 1 | 7 | 2018-09-13 | yes | TRUE | Other | 504 | 5.98 | 395 |
| 2012-09-16 | TRUE | Top 5 | Private room | 4 | 125 | 4.98 | 135 | 3 | 2.07 | 1 | 2 | 18 | 2021-09-01 | yes | FALSE | Private room in rental unit | 500 | 6.36 | 575 |
| 2016-01-07 | TRUE | Top 6-10 | Private room | 2 | 75 | 4.98 | 569 | 2 | 8.36 | 1 | 1 | 2 | 2021-08-15 | yes | TRUE | Private room in rental unit | 300 | 5.5 | 244 |
| 2013-10-30 | FALSE | Top 5 | Private room | 2 | 240 | 5 | 16 | 2 | 0.26 | 1 | 1 | 0 | 2017-08-20 | yes | FALSE | Other | 960 | 5.88 | 357 |
| 2013-09-08 | TRUE | Top 5 | Private room | 2 | 80 | 5 | 11 | 3 | 0.45 | 1 | 2 | 1 | 2019-10-06 | yes | FALSE | Private room in rental unit | 320 | 5.99 | 399 |
| 2014-10-02 | FALSE | Top 11-15 | Private room | 2 | 80 | 5 | 10 | 2 | 0.27 | 1 | 1 | 0 | 2018-10-07 | no | FALSE | Private room in rental unit | 320 | 5.61 | 273 |
| 2016-05-03 | FALSE | Remaining | Private room | 2 | 47 | 5 | 16 | 2 | 0.64 | 1 | 1 | 8 | 2019-07-20 | no | FALSE | Other | 188 | 5.62 | 276 |
| 2016-07-15 | TRUE | Top 5 | Private room | 2 | 122 | 5 | 15 | 1 | 0.28 | 1 | 1 | 22 | 2019-06-30 | yes | FALSE | Other | 488 | 6.28 | 534 |
| 2015-08-11 | TRUE | Top 5 | Private room | 2 | 114 | 4.98 | 243 | 2 | 4.4 | 1 | 1 | 13 | 2021-05-02 | yes | TRUE | Other | 456 | 6 | 403 |
| 2014-09-12 | TRUE | Top 5 | Private room | 2 | 118 | 4.98 | 136 | 2 | 3.34 | 1 | 1 | 13 | 2021-08-22 | yes | FALSE | Other | 472 | 6.07 | 433 |
| 2015-04-30 | TRUE | Top 5 | Private room | 2 | 90 | 4.99 | 213 | 3 | 3.98 | 1 | 1 | 9 | 2021-07-26 | yes | TRUE | Private room in rental unit | 360 | 5.96 | 389 |
| 2016-05-21 | TRUE | Top 5 | Private room | 2 | 270 | 5 | 32 | 2 | 1.23 | 1 | 1 | 0 | 2019-10-26 | yes | FALSE | Other | 1.08e+03 | 5.96 | 388 |
| 2017-04-17 | TRUE | Top 5 | Private room | 2 | 195 | 4.99 | 181 | 2 | 3.53 | 1 | 1 | 12 | 2021-08-16 | yes | FALSE | Private room in rental unit | 780 | 6.03 | 414 |
| 2014-07-21 | FALSE | Top 5 | Private room | 2 | 121 | 5 | 21 | 2 | 0.49 | 1 | 1 | 0 | 2019-06-19 | no | FALSE | Private room in rental unit | 484 | 5.88 | 356 |
| 2014-07-29 | TRUE | Top 5 | Private room | 2 | 119 | 5 | 66 | 2 | 1.31 | 1 | 1 | 0 | 2019-10-08 | yes | TRUE | Other | 476 | 5.94 | 379 |
| 2017-01-10 | FALSE | Remaining | Private room | 2 | 80 | 5 | 12 | 2 | 0.24 | 1 | 1 | 0 | 2017-09-17 | no | FALSE | Private room in rental unit | 320 | 5.51 | 247 |
| 2014-06-03 | TRUE | Remaining | Private room | 2 | 61 | 4.98 | 111 | 2 | 2.32 | 1 | 1 | 0 | 2021-08-16 | yes | FALSE | Private room in rental unit | 244 | 5.53 | 253 |
| 2014-07-16 | FALSE | Top 6-10 | Private room | 2 | 65 | 5 | 10 | 1 | 0.21 | 1 | 1 | 0 | 2017-11-03 | no | FALSE | Private room in rental unit | 260 | 5.76 | 316 |
| 2014-01-28 | TRUE | Top 6-10 | Private room | 2 | 110 | 4.98 | 122 | 2 | 2.67 | 1 | 1 | 0 | 2020-02-29 | yes | FALSE | Private room in rental unit | 440 | 5.77 | 321 |
| 2013-04-16 | FALSE | Top 6-10 | Private room | 2 | 69 | 5 | 14 | 2 | 0.34 | 1 | 1 | 0 | 2019-03-30 | no | FALSE | Private room in rental unit | 276 | 5.75 | 315 |
| 2017-10-10 | FALSE | Top 5 | Private room | 4 | 180 | 5 | 12 | 1 | 0.26 | 2 | 2 | 0 | 2018-02-24 | no | FALSE | Private room in rental unit | 720 | 6.24 | 513 |
| 2017-12-05 | TRUE | Top 11-15 | Private room | 2 | 80 | 4.99 | 77 | 2 | 1.87 | 1 | 1 | 7 | 2019-09-13 | yes | TRUE | Private room in rental unit | 320 | 5.75 | 316 |
| 2017-12-22 | FALSE | Top 6-10 | Private room | 4 | 150 | 4.98 | 56 | 3 | 1.25 | 2 | 3 | 27 | 2019-06-10 | yes | FALSE | Other | 600 | 6.46 | 638 |
| 2012-05-11 | TRUE | Top 5 | Private room | 4 | 238 | 5 | 33 | 3 | 0.91 | 2 | 2 | 7 | 2021-07-26 | yes | FALSE | Other | 952 | 6.42 | 611 |
| 2014-07-20 | TRUE | Remaining | Private room | 2 | 125 | 5 | 10 | 1 | 0.25 | 1 | 1 | 20 | 2018-07-25 | yes | FALSE | Other | 500 | 5.88 | 359 |
| 2014-06-23 | TRUE | Top 6-10 | Private room | 2 | 80 | 4.98 | 120 | 2 | 2.9 | 1 | 1 | 0 | 2020-01-11 | yes | TRUE | Other | 320 | 5.77 | 321 |
| 2014-04-10 | FALSE | Top 5 | Private room | 2 | 77 | 5 | 24 | 2 | 0.63 | 1 | 1 | 0 | 2018-10-28 | no | FALSE | Other | 308 | 5.87 | 355 |
| 2014-10-08 | TRUE | Top 5 | Private room | 2 | 140 | 5 | 26 | 2 | 0.64 | 1 | 1 | 0 | 2020-08-25 | yes | TRUE | Other | 560 | 5.96 | 389 |
| 2013-08-22 | TRUE | Top 5 | Private room | 2 | 110 | 4.99 | 142 | 1 | 3.47 | 1 | 1 | 6 | 2019-12-26 | yes | TRUE | Private room in rental unit | 440 | 5.97 | 391 |
| 2014-11-13 | TRUE | Top 5 | Private room | 2 | 127 | 5 | 28 | 3 | 0.68 | 1 | 1 | 28 | 2019-09-04 | yes | FALSE | Other | 508 | 6.35 | 575 |
| 2016-06-14 | FALSE | Top 11-15 | Private room | 2 | 85 | 5 | 10 | 1 | 0.25 | 1 | 2 | 0 | 2019-09-18 | no | TRUE | Private room in rental unit | 340 | 5.61 | 273 |
| 2012-05-23 | FALSE | Top 6-10 | Private room | 2 | 200 | 5 | 13 | 3 | 0.32 | 1 | 1 | 27 | 2018-08-20 | no | FALSE | Private room in rental unit | 800 | 6.13 | 460 |
| 2014-04-15 | FALSE | Top 5 | Private room | 2 | 60 | 5 | 20 | 3 | 0.61 | 1 | 1 | 0 | 2019-10-21 | yes | TRUE | Other | 240 | 5.88 | 356 |
| 2015-06-15 | TRUE | Top 6-10 | Private room | 2 | 74 | 5 | 15 | 1 | 0.52 | 2 | 1 | 0 | 2019-09-22 | no | FALSE | Other | 296 | 5.99 | 399 |
| 2015-01-24 | TRUE | Top 11-15 | Private room | 2 | 139 | 5 | 12 | 1 | 0.42 | 1 | 1 | 0 | 2019-05-06 | no | TRUE | Private room in rental unit | 556 | 5.7 | 299 |
| 2015-04-02 | FALSE | Top 5 | Private room | 2 | 164 | 5 | 21 | 2 | 0.62 | 1 | 1 | 12 | 2019-12-04 | yes | FALSE | Other | 656 | 6.04 | 421 |
| 2018-08-23 | TRUE | Top 5 | Private room | 4 | 126 | 5 | 90 | 1 | 2.8 | 1 | 2 | 6 | 2021-05-14 | yes | TRUE | Other | 504 | 6.22 | 503 |
| 2015-02-12 | TRUE | Top 5 | Private room | 2 | 139 | 4.99 | 71 | 2 | 2.16 | 1 | 1 | 0 | 2019-12-02 | yes | TRUE | Other | 556 | 5.93 | 377 |
| 2018-09-05 | TRUE | Top 5 | Private room | 2 | 115 | 5 | 19 | 2 | 0.62 | 1 | 1 | 0 | 2019-06-09 | no | TRUE | Other | 460 | 5.97 | 391 |
| 2016-03-13 | FALSE | Top 6-10 | Private room | 2 | 130 | 5 | 12 | 1 | 0.38 | 1 | 1 | 30 | 2019-05-09 | no | TRUE | Private room in rental unit | 520 | 6.17 | 480 |
| 2016-06-12 | TRUE | Remaining | Private room | 2 | 92 | 4.98 | 59 | 2 | 1.81 | 1 | 1 | 19 | 2019-11-11 | yes | TRUE | Other | 368 | 5.83 | 342 |
| 2015-12-15 | TRUE | Top 6-10 | Private room | 4 | 156 | 4.99 | 98 | 3 | 3.22 | 1 | 2 | 10 | 2021-08-13 | yes | TRUE | Other | 624 | 6.14 | 465 |
| 2012-06-10 | TRUE | Top 11-15 | Private room | 2 | 120 | 4.98 | 55 | 2 | 1.73 | 1 | 1 | 23 | 2019-10-18 | yes | FALSE | Other | 480 | 5.99 | 400 |
| 2013-12-23 | TRUE | Top 11-15 | Private room | 2 | 110 | 4.98 | 57 | 2 | 1.99 | 1 | 1 | 18 | 2019-12-08 | yes | FALSE | Other | 440 | 5.92 | 373 |
| 2016-08-04 | TRUE | Top 6-10 | Private room | 2 | 98 | 4.98 | 43 | 2 | 1.32 | 1 | 1 | 20 | 2019-11-04 | yes | FALSE | Other | 392 | 6.1 | 447 |
| 2016-02-18 | TRUE | Top 11-15 | Private room | 2 | 58 | 5 | 12 | 2 | 0.39 | 1 | 1 | 0 | 2019-05-03 | no | TRUE | Private room in rental unit | 232 | 5.7 | 299 |
| 2015-05-19 | FALSE | Top 5 | Private room | 2 | 89 | 5 | 15 | 2 | 0.46 | 1 | 2 | 1 | 2021-08-31 | yes | TRUE | Private room in rental unit | 356 | 5.89 | 363 |
| 2018-12-23 | TRUE | Top 5 | Private room | 3 | 199 | 5 | 29 | 2 | 2.19 | 1 | 2 | 11 | 2021-09-03 | yes | TRUE | Other | 796 | 6.22 | 504 |
| 2014-03-09 | TRUE | Top 5 | Private room | 4 | 207 | 5 | 21 | 2 | 0.86 | 1 | 2 | 17 | 2019-08-21 | yes | TRUE | Private room in rental unit | 828 | 6.42 | 614 |
| 2019-01-26 | TRUE | Top 11-15 | Private room | 2 | 125 | 5 | 22 | 3 | 0.76 | 1 | 2 | 13 | 2019-11-16 | yes | TRUE | Other | 500 | 5.88 | 357 |
| 2014-08-17 | FALSE | Remaining | Private room | 2 | 45 | 5 | 10 | 3 | 0.32 | 1 | 1 | 0 | 2020-01-02 | no | FALSE | Other | 180 | 5.51 | 248 |
| 2015-10-13 | TRUE | Top 5 | Private room | 2 | 75 | 5 | 18 | 3 | 0.63 | 1 | 1 | 0 | 2019-10-20 | yes | FALSE | Other | 300 | 5.97 | 392 |
| 2015-07-02 | TRUE | Top 6-10 | Private room | 2 | 77 | 5 | 20 | 2 | 0.67 | 1 | 1 | 13 | 2020-02-23 | yes | FALSE | Private room in rental unit | 308 | 6.02 | 413 |
| 2012-12-28 | TRUE | Top 5 | Private room | 2 | 98 | 5 | 16 | 3 | 0.56 | 1 | 1 | 9 | 2021-07-04 | yes | TRUE | Other | 392 | 6.1 | 445 |
| 2013-09-08 | TRUE | Top 5 | Private room | 2 | 88 | 5 | 34 | 1 | 1.24 | 1 | 1 | 0 | 2020-01-27 | no | FALSE | Private room in rental unit | 352 | 5.96 | 387 |
| 2019-04-04 | TRUE | Top 6-10 | Private room | 4 | 186 | 4.98 | 58 | 2 | 2.13 | 2 | 2 | 22 | 2020-09-20 | yes | FALSE | Other | 744 | 6.48 | 651 |
| 2015-10-16 | FALSE | Top 5 | Private room | 2 | 89 | 4.98 | 45 | 2 | 1.9 | 1 | 1 | 17 | 2021-08-29 | yes | TRUE | Other | 356 | 6.09 | 444 |
| 2019-03-23 | TRUE | Top 5 | Private room | 2 | 50 | 4.98 | 40 | 1 | 1.51 | 1 | 1 | 0 | 2020-03-07 | no | FALSE | Private room in rental unit | 200 | 5.95 | 385 |
| 2019-04-10 | TRUE | Top 11-15 | Private room | 2 | 95 | 5 | 29 | 2 | 1.01 | 1 | 1 | 0 | 2020-02-17 | no | TRUE | Private room in rental unit | 380 | 5.69 | 296 |
| 2019-05-28 | TRUE | Top 11-15 | Private room | 2 | 50 | 5 | 18 | 2 | 0.75 | 1 | 1 | 0 | 2020-03-02 | no | FALSE | Private room in rental unit | 200 | 5.7 | 298 |
| 2014-08-23 | TRUE | Top 5 | Private room | 2 | 52 | 5 | 10 | 2 | 0.39 | 1 | 1 | 0 | 2019-08-12 | no | TRUE | Private room in rental unit | 208 | 5.98 | 394 |
| 2014-04-17 | TRUE | Remaining | Private room | 2 | 60 | 5 | 12 | 2 | 0.49 | 1 | 1 | 16 | 2020-02-03 | no | FALSE | Private room in rental unit | 240 | 5.83 | 339 |
| 2017-07-23 | TRUE | Remaining | Private room | 2 | 136 | 5 | 11 | 2 | 0.42 | 1 | 1 | 0 | 2019-08-09 | no | TRUE | Private room in rental unit | 544 | 5.6 | 271 |
| 2015-04-22 | TRUE | Top 5 | Private room | 2 | 110 | 5 | 18 | 2 | 0.78 | 1 | 4 | 4 | 2021-08-01 | yes | FALSE | Other | 440 | 6.03 | 414 |
| 2017-07-16 | TRUE | Top 11-15 | Private room | 4 | 92 | 5 | 37 | 2 | 1.47 | 1 | 2 | 18 | 2020-02-24 | yes | TRUE | Other | 368 | 6.15 | 469 |
| 2017-07-16 | TRUE | Top 11-15 | Private room | 4 | 91 | 4.98 | 40 | 2 | 1.55 | 1 | 2 | 17 | 2020-03-08 | yes | TRUE | Other | 364 | 6.13 | 461 |
| 2013-09-25 | TRUE | Top 11-15 | Private room | 2 | 85 | 5 | 11 | 2 | 0.46 | 1 | 1 | 0 | 2019-09-19 | no | TRUE | Other | 340 | 5.7 | 300 |
| 2015-01-02 | TRUE | Top 5 | Private room | 2 | 120 | 5 | 14 | 2 | 0.58 | 1 | 1 | 17 | 2019-11-03 | yes | FALSE | Other | 480 | 6.21 | 498 |
| 2019-07-22 | TRUE | Top 5 | Private room | 2 | 158 | 5 | 11 | 2 | 0.47 | 1 | 1 | 22 | 2019-09-15 | yes | FALSE | Private room in rental unit | 632 | 6.28 | 535 |
| 2017-01-09 | TRUE | Top 5 | Private room | 2 | 150 | 5 | 10 | 1 | 0.4 | 1 | 1 | 0 | 2019-11-24 | no | FALSE | Private room in rental unit | 600 | 5.98 | 394 |
| 2011-09-15 | FALSE | Top 5 | Private room | 2 | 80 | 5 | 15 | 1 | 0.6 | 1 | 1 | 0 | 2020-02-02 | no | FALSE | Other | 320 | 5.88 | 358 |
| 2019-08-02 | TRUE | Top 6-10 | Private room | 2 | 189 | 5 | 12 | 1 | 0.62 | 1 | 1 | 26 | 2020-02-07 | yes | TRUE | Other | 756 | 6.21 | 498 |
| 2015-06-29 | TRUE | Top 6-10 | Private room | 2 | 76 | 4.98 | 41 | 2 | 1.69 | 1 | 1 | 0 | 2020-03-10 | yes | FALSE | Other | 304 | 5.83 | 339 |
| 2013-02-26 | TRUE | Remaining | Private room | 2 | 70 | 5 | 12 | 2 | 0.51 | 1 | 1 | 0 | 2019-12-01 | yes | FALSE | Private room in rental unit | 280 | 5.6 | 271 |
| 2013-07-12 | TRUE | Top 6-10 | Private room | 2 | 160 | 5 | 15 | 2 | 1.14 | 1 | 1 | 30 | 2020-01-04 | yes | FALSE | Other | 640 | 6.26 | 525 |
| 2015-06-29 | FALSE | Top 5 | Private room | 4 | 220 | 5 | 13 | 1 | 8.48 | 2 | 2 | 0 | 2021-08-02 | yes | TRUE | Other | 880 | 6.24 | 512 |
| 2019-11-25 | FALSE | Top 5 | Private room | 2 | 99 | 5 | 12 | 1 | 3.3 | 1 | 2 | 0 | 2021-09-01 | no | TRUE | Other | 396 | 5.88 | 358 |
| 2019-12-20 | TRUE | Top 6-10 | Private room | 2 | 129 | 5 | 16 | 2 | 0.83 | 1 | 1 | 0 | 2020-03-08 | yes | FALSE | Other | 516 | 5.84 | 345 |
| 2020-01-29 | TRUE | Top 5 | Private room | 2 | 79 | 5 | 18 | 2 | 0.95 | 1 | 2 | 12 | 2021-06-05 | yes | TRUE | Other | 316 | 6.14 | 463 |
| 2020-02-02 | TRUE | Top 5 | Private room | 2 | 94 | 5 | 18 | 3 | 1.28 | 1 | 1 | 0 | 2021-09-03 | yes | FALSE | Private room in rental unit | 376 | 5.97 | 392 |
| 2020-02-04 | TRUE | Top 5 | Private room | 2 | 104 | 5 | 28 | 1 | 1.47 | 1 | 1 | 0 | 2021-07-18 | yes | TRUE | Other | 416 | 5.96 | 389 |
| 2020-02-28 | TRUE | Top 11-15 | Private room | 4 | 114 | 5 | 31 | 2 | 2.23 | 2 | 2 | 0 | 2021-07-04 | yes | FALSE | Private room in rental unit | 456 | 6.05 | 423 |
| 2020-03-09 | TRUE | Top 5 | Private room | 2 | 138 | 5 | 21 | 2 | 1.61 | 1 | 1 | 19 | 2021-08-07 | yes | TRUE | Other | 552 | 6.23 | 510 |
| 2016-01-26 | FALSE | Remaining | Private room | 2 | 54 | 5 | 10 | 2 | 10 | 1 | 1 | 14 | 2020-10-22 | yes | FALSE | Private room in rental unit | 216 | 5.71 | 301 |
| 2016-05-25 | TRUE | Top 5 | Private room | 2 | 169 | 5 | 32 | 1 | 2.55 | 1 | 2 | 20 | 2020-10-08 | yes | FALSE | Other | 676 | 6.24 | 513 |
| 2015-05-14 | TRUE | Top 5 | Private room | 4 | 137 | 5 | 14 | 2 | 4.42 | 2 | 2 | 18 | 2021-08-03 | yes | FALSE | Other | 548 | 6.58 | 722 |
| 2015-03-04 | TRUE | Top 5 | Private room | 2 | 158 | 5 | 16 | 1 | 5.58 | 1 | 1 | 13 | 2021-09-05 | yes | TRUE | Other | 632 | 6.15 | 470 |
We now have our predictions in place. Lets get the lower and upper CI. To do that we need the Sigma (Residual Standard Eroro)
#Let us take a look at our model
modelsigma <- model_final %>%
#Glance at key summary
glance() %>%
#Select Sigma
select(sigma)
modelsigma| sigma |
|---|
| 0.351 |
Our Residual STandard error is 0.35. We need to add 2x the residual standard error to get the prediction. Let’s add this to our predicted values.
#Utilizing the full dataset
targets_prediction <- targets %>%
#Adding the Sigma to a column
mutate(sigma = 0.3512151,
#Getting the Lower CI in log values
lower_ci = log_predicted_values - sigma*2,
#Getting the upper CI in log values
upper_ci = log_predicted_values + sigma*2,
#Converting Lower CI to actual values
LowerCI = exp(lower_ci),
#Converting Upper CI to actual values
UpperCI = exp(upper_ci)) %>%
#Selecting key variables
select(price_4_nights, prediction, LowerCI, UpperCI)
targets_prediction| price_4_nights | prediction | LowerCI | UpperCI |
|---|---|---|---|
| 980 | 364 | 180 | 735 |
| 540 | 478 | 237 | 965 |
| 620 | 378 | 187 | 762 |
| 180 | 293 | 145 | 591 |
| 356 | 393 | 195 | 794 |
| 280 | 346 | 171 | 698 |
| 504 | 395 | 196 | 797 |
| 500 | 575 | 285 | 1.16e+03 |
| 300 | 244 | 121 | 493 |
| 960 | 357 | 177 | 721 |
| 320 | 399 | 198 | 805 |
| 320 | 273 | 135 | 552 |
| 188 | 276 | 137 | 557 |
| 488 | 534 | 264 | 1.08e+03 |
| 456 | 403 | 199 | 813 |
| 472 | 433 | 214 | 874 |
| 360 | 389 | 193 | 785 |
| 1.08e+03 | 388 | 192 | 783 |
| 780 | 414 | 205 | 837 |
| 484 | 356 | 176 | 719 |
| 476 | 379 | 188 | 765 |
| 320 | 247 | 123 | 499 |
| 244 | 253 | 125 | 511 |
| 260 | 316 | 157 | 638 |
| 440 | 321 | 159 | 647 |
| 276 | 315 | 156 | 636 |
| 720 | 513 | 254 | 1.03e+03 |
| 320 | 316 | 156 | 637 |
| 600 | 638 | 316 | 1.29e+03 |
| 952 | 611 | 303 | 1.23e+03 |
| 500 | 359 | 178 | 725 |
| 320 | 321 | 159 | 648 |
| 308 | 355 | 176 | 718 |
| 560 | 389 | 193 | 786 |
| 440 | 391 | 194 | 790 |
| 508 | 575 | 285 | 1.16e+03 |
| 340 | 273 | 135 | 552 |
| 800 | 460 | 228 | 928 |
| 240 | 356 | 177 | 720 |
| 296 | 399 | 198 | 805 |
| 556 | 299 | 148 | 604 |
| 656 | 421 | 209 | 850 |
| 504 | 503 | 249 | 1.01e+03 |
| 556 | 377 | 187 | 762 |
| 460 | 391 | 194 | 790 |
| 520 | 480 | 238 | 969 |
| 368 | 342 | 169 | 690 |
| 624 | 465 | 230 | 939 |
| 480 | 400 | 198 | 808 |
| 440 | 373 | 185 | 752 |
| 392 | 447 | 222 | 903 |
| 232 | 299 | 148 | 604 |
| 356 | 363 | 180 | 732 |
| 796 | 504 | 250 | 1.02e+03 |
| 828 | 614 | 304 | 1.24e+03 |
| 500 | 357 | 177 | 720 |
| 180 | 248 | 123 | 500 |
| 300 | 392 | 194 | 790 |
| 308 | 413 | 205 | 833 |
| 392 | 445 | 220 | 897 |
| 352 | 387 | 192 | 782 |
| 744 | 651 | 323 | 1.32e+03 |
| 356 | 444 | 220 | 895 |
| 200 | 385 | 191 | 777 |
| 380 | 296 | 147 | 598 |
| 200 | 298 | 148 | 602 |
| 208 | 394 | 195 | 795 |
| 240 | 339 | 168 | 685 |
| 544 | 271 | 134 | 548 |
| 440 | 414 | 205 | 836 |
| 368 | 469 | 232 | 947 |
| 364 | 461 | 228 | 930 |
| 340 | 300 | 148 | 605 |
| 480 | 498 | 247 | 1e+03 |
| 632 | 535 | 265 | 1.08e+03 |
| 600 | 394 | 195 | 795 |
| 320 | 358 | 177 | 722 |
| 756 | 498 | 247 | 1e+03 |
| 304 | 339 | 168 | 684 |
| 280 | 271 | 134 | 548 |
| 640 | 525 | 260 | 1.06e+03 |
| 880 | 512 | 254 | 1.03e+03 |
| 396 | 358 | 178 | 723 |
| 516 | 345 | 171 | 697 |
| 316 | 463 | 229 | 935 |
| 376 | 392 | 194 | 790 |
| 416 | 389 | 193 | 785 |
| 456 | 423 | 209 | 854 |
| 552 | 510 | 252 | 1.03e+03 |
| 216 | 301 | 149 | 608 |
| 676 | 513 | 254 | 1.04e+03 |
| 548 | 722 | 358 | 1.46e+03 |
| 632 | 470 | 233 | 949 |
skim(targets_prediction)| Name | targets_prediction |
| Number of rows | 93 |
| Number of columns | 4 |
| _______________________ | |
| Column type frequency: | |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| price_4_nights | 0 | 1 | 465.20 | 198.34 | 180.00 | 320.00 | 440.00 | 556.00 | 1080.00 | ▇▇▃▂▁ |
| prediction | 0 | 1 | 403.59 | 97.81 | 244.19 | 342.04 | 391.25 | 462.99 | 722.19 | ▅▇▃▁▁ |
| LowerCI | 0 | 1 | 199.93 | 48.45 | 120.97 | 169.44 | 193.82 | 229.36 | 357.76 | ▅▇▃▁▁ |
| UpperCI | 0 | 1 | 814.71 | 197.45 | 492.93 | 690.46 | 789.80 | 934.62 | 1457.85 | ▅▇▃▁▁ |
We also see that because of our low R squared these intervals are extremely broad. In fact the mean prediction is on average 404, however, the 95% confidence intervals are on average 200 for the lower interval and and as high as 815 for the upper interval.
Just for reference we want to plot these
targets_prediction_plot <- targets_prediction %>%
pivot_longer(cols = 2:4,
names_to = "Interval",
values_to = "price")
#Plotting the intervals density
ggplot(targets_prediction_plot, aes(x = price, fill = Interval)) +
#Predicted prices
geom_density(alpha = 0.2) +
#Simple Theme
theme_bw() +
labs(title = "Predicted prices versus confidence intervals",
x = "Predicted Price",
y = "Density") +
NULLWe see there is a great difference between our predicted price and lower and upper confidence intervals.
This shows us that the model although it explains 48% still has quite high variety in the prices.
I.e. some prices go as high as 1,500